Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

2019-01-24 Thread shane knapp
tomorrow i will continue the purge.  :)

On Thu, Jan 24, 2019 at 6:13 PM Sean Owen  wrote:

> No and we could retire 2.2 now too but wouldn't hurt to keep it a bit
> longer in case we have to make a critical release even though it's EOL.
>
> On Thu, Jan 24, 2019, 7:05 PM shane knapp 
>> s/job/jobs
>>
>> these are for the spark-(master|branch-X)-docs builds, so right now i am
>> talking about removing 6 builds for the following branches:
>>
>> 1.6
>> 2.0
>> 2.1
>> 2.3
>> 2.4
>> master
>>
>> in fact, do we even need ANY builds for 1.6, 2.0 and 2.1?
>>
>> On Thu, Jan 24, 2019 at 5:57 PM Sean Owen  wrote:
>>
>>> I think we can just remove this job.
>>>
>>> On Thu, Jan 24, 2019 at 6:44 PM shane knapp  wrote:
>>> >
>>> > On Sun, Jan 13, 2019 at 11:22 AM Felix Cheung <
>>> felixcheun...@hotmail.com> wrote:
>>> >>
>>> >> Eh, yeah, like the one with signing, I think doc build is mostly
>>> useful when a) right before we do a release or during the RC resets; b)
>>> someone makes a huge change to doc and want to check
>>> >>
>>> >> Not sure we need this nightly?
>>> >>
>>> > ohai!  i found the thread!  :)
>>> >
>>> > (see the other emails i sent today, i have currently disabled all of
>>> these branch-based nightly doc builds)
>>> >
>>> > anyways, my thoughts:
>>> >
>>> > we almost *certainly* do not need this to be run nightly...  if at
>>> all.  i am highly dubious of the relative usefulness of these builds.
>>> >
>>> > if someone is to make a massive amount of changes to the spark site,
>>> they can just manually create run the doc build (via 'do-release-docker.sh
>>> -n -s docs') and then check things out locally from the spark/docs/_site
>>> directory[1][2].
>>> >
>>> > another option would be to ruby and jekyll installed on their dev
>>> machine (or a vm or whatever) and just run 'PRODUCTION=1
>>> RELEASE_VERSION="$SPARK_VERSION" jekyll build' from the spark/docs subdir
>>> (with the new site appearing in spark/docs/_site)[2][3].
>>> >
>>> > thoughts?
>>> >
>>> > shane
>>> >
>>> > [1]  i'm not sure if that dir will be easily accessible outside of the
>>> spark-rm docker container, but i can probably check this out tomorrow.
>>> > [2]  this will absolutely need to be documented somewhere (or
>>> somewheres).
>>> > [3]  this is my preferred solution.
>>> > --
>>> > Shane Knapp
>>> > UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> > https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

2019-01-24 Thread shane knapp
s/job/jobs

these are for the spark-(master|branch-X)-docs builds, so right now i am
talking about removing 6 builds for the following branches:

1.6
2.0
2.1
2.3
2.4
master

in fact, do we even need ANY builds for 1.6, 2.0 and 2.1?

On Thu, Jan 24, 2019 at 5:57 PM Sean Owen  wrote:

> I think we can just remove this job.
>
> On Thu, Jan 24, 2019 at 6:44 PM shane knapp  wrote:
> >
> > On Sun, Jan 13, 2019 at 11:22 AM Felix Cheung 
> wrote:
> >>
> >> Eh, yeah, like the one with signing, I think doc build is mostly useful
> when a) right before we do a release or during the RC resets; b) someone
> makes a huge change to doc and want to check
> >>
> >> Not sure we need this nightly?
> >>
> > ohai!  i found the thread!  :)
> >
> > (see the other emails i sent today, i have currently disabled all of
> these branch-based nightly doc builds)
> >
> > anyways, my thoughts:
> >
> > we almost *certainly* do not need this to be run nightly...  if at all.
> i am highly dubious of the relative usefulness of these builds.
> >
> > if someone is to make a massive amount of changes to the spark site,
> they can just manually create run the doc build (via 'do-release-docker.sh
> -n -s docs') and then check things out locally from the spark/docs/_site
> directory[1][2].
> >
> > another option would be to ruby and jekyll installed on their dev
> machine (or a vm or whatever) and just run 'PRODUCTION=1
> RELEASE_VERSION="$SPARK_VERSION" jekyll build' from the spark/docs subdir
> (with the new site appearing in spark/docs/_site)[2][3].
> >
> > thoughts?
> >
> > shane
> >
> > [1]  i'm not sure if that dir will be easily accessible outside of the
> spark-rm docker container, but i can probably check this out tomorrow.
> > [2]  this will absolutely need to be documented somewhere (or
> somewheres).
> > [3]  this is my preferred solution.
> > --
> > Shane Knapp
> > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

2019-01-24 Thread Sean Owen
No and we could retire 2.2 now too but wouldn't hurt to keep it a bit
longer in case we have to make a critical release even though it's EOL.

On Thu, Jan 24, 2019, 7:05 PM shane knapp  s/job/jobs
>
> these are for the spark-(master|branch-X)-docs builds, so right now i am
> talking about removing 6 builds for the following branches:
>
> 1.6
> 2.0
> 2.1
> 2.3
> 2.4
> master
>
> in fact, do we even need ANY builds for 1.6, 2.0 and 2.1?
>
> On Thu, Jan 24, 2019 at 5:57 PM Sean Owen  wrote:
>
>> I think we can just remove this job.
>>
>> On Thu, Jan 24, 2019 at 6:44 PM shane knapp  wrote:
>> >
>> > On Sun, Jan 13, 2019 at 11:22 AM Felix Cheung <
>> felixcheun...@hotmail.com> wrote:
>> >>
>> >> Eh, yeah, like the one with signing, I think doc build is mostly
>> useful when a) right before we do a release or during the RC resets; b)
>> someone makes a huge change to doc and want to check
>> >>
>> >> Not sure we need this nightly?
>> >>
>> > ohai!  i found the thread!  :)
>> >
>> > (see the other emails i sent today, i have currently disabled all of
>> these branch-based nightly doc builds)
>> >
>> > anyways, my thoughts:
>> >
>> > we almost *certainly* do not need this to be run nightly...  if at
>> all.  i am highly dubious of the relative usefulness of these builds.
>> >
>> > if someone is to make a massive amount of changes to the spark site,
>> they can just manually create run the doc build (via 'do-release-docker.sh
>> -n -s docs') and then check things out locally from the spark/docs/_site
>> directory[1][2].
>> >
>> > another option would be to ruby and jekyll installed on their dev
>> machine (or a vm or whatever) and just run 'PRODUCTION=1
>> RELEASE_VERSION="$SPARK_VERSION" jekyll build' from the spark/docs subdir
>> (with the new site appearing in spark/docs/_site)[2][3].
>> >
>> > thoughts?
>> >
>> > shane
>> >
>> > [1]  i'm not sure if that dir will be easily accessible outside of the
>> spark-rm docker container, but i can probably check this out tomorrow.
>> > [2]  this will absolutely need to be documented somewhere (or
>> somewheres).
>> > [3]  this is my preferred solution.
>> > --
>> > Shane Knapp
>> > UC Berkeley EECS Research / RISELab Staff Technical Lead
>> > https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

2019-01-24 Thread Sean Owen
I think we can just remove this job.

On Thu, Jan 24, 2019 at 6:44 PM shane knapp  wrote:
>
> On Sun, Jan 13, 2019 at 11:22 AM Felix Cheung  
> wrote:
>>
>> Eh, yeah, like the one with signing, I think doc build is mostly useful when 
>> a) right before we do a release or during the RC resets; b) someone makes a 
>> huge change to doc and want to check
>>
>> Not sure we need this nightly?
>>
> ohai!  i found the thread!  :)
>
> (see the other emails i sent today, i have currently disabled all of these 
> branch-based nightly doc builds)
>
> anyways, my thoughts:
>
> we almost *certainly* do not need this to be run nightly...  if at all.  i am 
> highly dubious of the relative usefulness of these builds.
>
> if someone is to make a massive amount of changes to the spark site, they can 
> just manually create run the doc build (via 'do-release-docker.sh -n -s 
> docs') and then check things out locally from the spark/docs/_site 
> directory[1][2].
>
> another option would be to ruby and jekyll installed on their dev machine (or 
> a vm or whatever) and just run 'PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" 
> jekyll build' from the spark/docs subdir (with the new site appearing in 
> spark/docs/_site)[2][3].
>
> thoughts?
>
> shane
>
> [1]  i'm not sure if that dir will be easily accessible outside of the 
> spark-rm docker container, but i can probably check this out tomorrow.
> [2]  this will absolutely need to be documented somewhere (or somewheres).
> [3]  this is my preferred solution.
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

2019-01-24 Thread shane knapp
On Sun, Jan 13, 2019 at 11:22 AM Felix Cheung 
wrote:

> Eh, yeah, like the one with signing, I think doc build is mostly useful
> when a) right before we do a release or during the RC resets; b) someone
> makes a huge change to doc and want to check
>
> Not sure we need this nightly?
>
> ohai!  i found the thread!  :)

(see the other emails i sent today, i have currently disabled all of these
branch-based nightly doc builds)

anyways, my thoughts:

we almost *certainly* do not need this to be run nightly...  if at all.  i
am highly dubious of the relative usefulness of these builds.

if someone is to make a massive amount of changes to the spark site, they
can just manually create run the doc build (via 'do-release-docker.sh -n -s
docs') and then check things out locally from the spark/docs/_site
directory[1][2].

another option would be to ruby and jekyll installed on their dev machine
(or a vm or whatever) and just run 'PRODUCTION=1
RELEASE_VERSION="$SPARK_VERSION" jekyll build' from the spark/docs subdir
(with the new site appearing in spark/docs/_site)[2][3].

thoughts?

shane

[1]  i'm not sure if that dir will be easily accessible outside of the
spark-rm docker container, but i can probably check this out tomorrow.
[2]  this will absolutely need to be documented somewhere (or somewheres).
[3]  this is my preferred solution.
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: moving the spark jenkins job builder repo from dbricks --> spark

2019-01-24 Thread shane knapp
looking here:
https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_24_10_34-69dab94-docs/

and here:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-docs/5312/console

this does confirm that these artifacts are indeed created by the packaging
docs builds.

i will disable them manually on jenkins, and make a note when moving the
JJB configs to not create them moving forward.

thanks sean!

shane



On Thu, Jan 24, 2019 at 4:48 PM Sean Owen  wrote:

> Are these docs builds creating the SNAPSHOT docs builds at
> https://dist.apache.org/repos/dist/dev/spark/ ? I think from a thread
> last month, these aren't used and should probably just be stopped.
>
> On Thu, Jan 24, 2019 at 3:34 PM shane knapp  wrote:
> >
> > revisiting this thread from october...  sorry for the delay in getting
> around to this until now, but the jenkins job builder configs (and
> associated apache credentials stored in there) are *directly* related to
> the work i'm doing here:
> > https://issues.apache.org/jira/browse/SPARK-26565
> > https://github.com/apache/spark/pull/23492
> >
> > anyways, for each branch, we currently have three packaging builds (
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/):  docs,
> maven snapshot and release.
> >
> > i'm currently working on the release builds to test the release process
> w/o pushing artifacts (see above issue/PR).
> >
> > the maven snapshot builds are green, and working as intended (and use
> the ASF creds).
> >
> > my question is:  are we currently relying on any of these doc builds?
> >
> > thanks in advance,
> >
> > shane
> >
> > On Wed, Oct 17, 2018 at 10:48 AM shane knapp 
> wrote:
> >>
> >> On Wed, Oct 17, 2018 at 10:25 AM Yin Huai  wrote:
> >>>
> >>> Shane, Thank you for initiating this work! Can we do an audit of
> jenkins users and trim down the list?
> >>>
> >> re pruning external (spark-specific) users w/shell and jenkins login
> access:  we can absolutely do this.
> >>
> >> limiting logins for EECS students/faculty/staff is possible, but i will
> need to do some experiments.  we're using SSSD to manage our LDAP logins,
> and it is supposed to handle group filtering but i haven't had much luck
> actually getting it working.
> >>
> >>>
> >>> Also, for packaging jobs, those branch snapshot jobs are active (for
> example,
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> for publishing snapshot builds from master branch). They still need
> credentials. After we remove the encrypted credential file, are we planning
> to use jenkins as the single place to manage those credentials and we just
> refer to them in jenkins job config?
> >>>
> >> well, since the creds in the repo are actually encrypted, i think that
> keeping them in there is actually fine.  since i wasn't the one who set any
> of this up, however, i will defer to josh about this.
> >>
> >> shane
> >>
> >>>
> >>> On Wed, Oct 10, 2018 at 12:06 PM shane knapp 
> wrote:
> >
> > Not sure if that's what you meant; but it should be ok for the
> jenkins
> > servers to manually sync with master after you (or someone else) have
> > verified the changes. That should prevent inadvertent breakages since
> > I don't expect it to be easy to test those scripts without access to
> > some test jenkins server.
> >
>  JJB has some built-in lint and testing, so that'll be the first step
> in verifying the build configs.
> 
>  i still have a dream where i have a fully functioning jenkins staging
> deployment...  one day i will make that happen.  :)
> 
>  shane
> 
>  --
>  Shane Knapp
>  UC Berkeley EECS Research / RISELab Staff Technical Lead
>  https://rise.cs.berkeley.edu
> >>
> >>
> >>
> >> --
> >> Shane Knapp
> >> UC Berkeley EECS Research / RISELab Staff Technical Lead
> >> https://rise.cs.berkeley.edu
> >
> >
> >
> > --
> > Shane Knapp
> > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: moving the spark jenkins job builder repo from dbricks --> spark

2019-01-24 Thread Sean Owen
Are these docs builds creating the SNAPSHOT docs builds at
https://dist.apache.org/repos/dist/dev/spark/ ? I think from a thread
last month, these aren't used and should probably just be stopped.

On Thu, Jan 24, 2019 at 3:34 PM shane knapp  wrote:
>
> revisiting this thread from october...  sorry for the delay in getting around 
> to this until now, but the jenkins job builder configs (and associated apache 
> credentials stored in there) are *directly* related to the work i'm doing 
> here:
> https://issues.apache.org/jira/browse/SPARK-26565
> https://github.com/apache/spark/pull/23492
>
> anyways, for each branch, we currently have three packaging builds 
> (https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/):  docs, 
> maven snapshot and release.
>
> i'm currently working on the release builds to test the release process w/o 
> pushing artifacts (see above issue/PR).
>
> the maven snapshot builds are green, and working as intended (and use the ASF 
> creds).
>
> my question is:  are we currently relying on any of these doc builds?
>
> thanks in advance,
>
> shane
>
> On Wed, Oct 17, 2018 at 10:48 AM shane knapp  wrote:
>>
>> On Wed, Oct 17, 2018 at 10:25 AM Yin Huai  wrote:
>>>
>>> Shane, Thank you for initiating this work! Can we do an audit of jenkins 
>>> users and trim down the list?
>>>
>> re pruning external (spark-specific) users w/shell and jenkins login access: 
>>  we can absolutely do this.
>>
>> limiting logins for EECS students/faculty/staff is possible, but i will need 
>> to do some experiments.  we're using SSSD to manage our LDAP logins, and it 
>> is supposed to handle group filtering but i haven't had much luck actually 
>> getting it working.
>>
>>>
>>> Also, for packaging jobs, those branch snapshot jobs are active (for 
>>> example, 
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>  for publishing snapshot builds from master branch). They still need 
>>> credentials. After we remove the encrypted credential file, are we planning 
>>> to use jenkins as the single place to manage those credentials and we just 
>>> refer to them in jenkins job config?
>>>
>> well, since the creds in the repo are actually encrypted, i think that 
>> keeping them in there is actually fine.  since i wasn't the one who set any 
>> of this up, however, i will defer to josh about this.
>>
>> shane
>>
>>>
>>> On Wed, Oct 10, 2018 at 12:06 PM shane knapp  wrote:
>
> Not sure if that's what you meant; but it should be ok for the jenkins
> servers to manually sync with master after you (or someone else) have
> verified the changes. That should prevent inadvertent breakages since
> I don't expect it to be easy to test those scripts without access to
> some test jenkins server.
>
 JJB has some built-in lint and testing, so that'll be the first step in 
 verifying the build configs.

 i still have a dream where i have a fully functioning jenkins staging 
 deployment...  one day i will make that happen.  :)

 shane

 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: DSv2 question

2019-01-24 Thread Jungtaek Lim
I guess explaining rationalization would be better to understanding the
situation.

It's related to skip converting params to lowercase before assigning to
Kafka parameter. (https://github.com/apache/spark/pull/23612) If we
guarantee lowercase key on interface(s) we can simply pass them to Kafka as
well, and if not we may want to convert to lowercase to ensure safety.

2019년 1월 25일 (금) 오전 3:27, Joseph Torres 님이 작성:

> I wouldn't be opposed to also documenting that we canonicalize the keys as
> lowercase, but the case-insensitivity is I think the primary property. It's
> important to call out that data source developers don't have to worry about
> a semantic difference between option("mykey", "value") and option("myKey",
> "value").
>
> On Thu, Jan 24, 2019 at 9:58 AM Gabor Somogyi 
> wrote:
>
>> Hi All,
>>
>> Given org.apache.spark.sql.sources.v2.DataSourceOptions which states the
>> following:
>>
>> * An immutable string-to-string map in which keys are case-insensitive. This 
>> is used to represent
>> * data source options.
>>
>> Case-insensitivity can be reached many ways.The implementation provides
>> lowercase solution.
>>
>> I've seen code parts which take advantage of this implementation detail.
>> My questions are:
>>
>> 1. As the class only states case-insensitive is the lowercase a subject
>> to change?
>> 2. If it's not subject to change wouldn't it be better to change
>> case-insensitive to lowercase or something?
>>
>> I've seen similar pattern on interfaces...
>>
>> Thanks in advance!
>>
>> BR,
>> G
>>
>>


Re: moving the spark jenkins job builder repo from dbricks --> spark

2019-01-24 Thread shane knapp
revisiting this thread from october...  sorry for the delay in getting
around to this until now, but the jenkins job builder configs (and
associated apache credentials stored in there) are *directly* related to
the work i'm doing here:
https://issues.apache.org/jira/browse/SPARK-26565
https://github.com/apache/spark/pull/23492

anyways, for each branch, we currently have three packaging builds (
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/):  docs,
maven snapshot and release.

i'm currently working on the release builds to test the release process w/o
pushing artifacts (see above issue/PR).

the maven snapshot builds are green, and working as intended (and use the
ASF creds).

my question is:  are we currently relying on any of these doc builds?

thanks in advance,

shane

On Wed, Oct 17, 2018 at 10:48 AM shane knapp  wrote:

> On Wed, Oct 17, 2018 at 10:25 AM Yin Huai  wrote:
>
>> Shane, Thank you for initiating this work! Can we do an audit of jenkins
>> users and trim down the list?
>>
>> re pruning external (spark-specific) users w/shell and jenkins login
> access:  we can absolutely do this.
>
> limiting logins for EECS students/faculty/staff is possible, but i will
> need to do some experiments.  we're using SSSD to manage our LDAP logins,
> and it is supposed to handle group filtering but i haven't had much luck
> actually getting it working.
>
>
>> Also, for packaging jobs, those branch snapshot jobs are active (for
>> example,
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>> for publishing snapshot builds from master branch). They still need
>> credentials. After we remove the encrypted credential file, are we planning
>> to use jenkins as the single place to manage those credentials and we just
>> refer to them in jenkins job config?
>>
>> well, since the creds in the repo are actually encrypted, i think that
> keeping them in there is actually fine.  since i wasn't the one who set any
> of this up, however, i will defer to josh about this.
>
> shane
>
>
>> On Wed, Oct 10, 2018 at 12:06 PM shane knapp  wrote:
>>
>>> Not sure if that's what you meant; but it should be ok for the jenkins
 servers to manually sync with master after you (or someone else) have
 verified the changes. That should prevent inadvertent breakages since
 I don't expect it to be easy to test those scripts without access to
 some test jenkins server.

 JJB has some built-in lint and testing, so that'll be the first step in
>>> verifying the build configs.
>>>
>>> i still have a dream where i have a fully functioning jenkins staging
>>> deployment...  one day i will make that happen.  :)
>>>
>>> shane
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: DSv2 question

2019-01-24 Thread Joseph Torres
I wouldn't be opposed to also documenting that we canonicalize the keys as
lowercase, but the case-insensitivity is I think the primary property. It's
important to call out that data source developers don't have to worry about
a semantic difference between option("mykey", "value") and option("myKey",
"value").

On Thu, Jan 24, 2019 at 9:58 AM Gabor Somogyi 
wrote:

> Hi All,
>
> Given org.apache.spark.sql.sources.v2.DataSourceOptions which states the
> following:
>
> * An immutable string-to-string map in which keys are case-insensitive. This 
> is used to represent
> * data source options.
>
> Case-insensitivity can be reached many ways.The implementation provides
> lowercase solution.
>
> I've seen code parts which take advantage of this implementation detail.
> My questions are:
>
> 1. As the class only states case-insensitive is the lowercase a subject to
> change?
> 2. If it's not subject to change wouldn't it be better to change
> case-insensitive to lowercase or something?
>
> I've seen similar pattern on interfaces...
>
> Thanks in advance!
>
> BR,
> G
>
>


DSv2 question

2019-01-24 Thread Gabor Somogyi
Hi All,

Given org.apache.spark.sql.sources.v2.DataSourceOptions which states the
following:

* An immutable string-to-string map in which keys are
case-insensitive. This is used to represent
* data source options.

Case-insensitivity can be reached many ways.The implementation provides
lowercase solution.

I've seen code parts which take advantage of this implementation detail. My
questions are:

1. As the class only states case-insensitive is the lowercase a subject to
change?
2. If it's not subject to change wouldn't it be better to change
case-insensitive to lowercase or something?

I've seen similar pattern on interfaces...

Thanks in advance!

BR,
G


Re: Missing SparkR in CRAN

2019-01-24 Thread Felix Cheung
Yes it was discussed on dev@. We are waiting for 2.3.3 to release to
resubmit.


On Thu, Jan 24, 2019 at 5:33 AM Hyukjin Kwon  wrote:

> Hi all,
>
> I happened to find SparkR is missing in CRAN. See
> https://cran.r-project.org/web/packages/SparkR/index.html
>
> I remember I saw some threads about this in spark-dev mailing list a long
> long ago IIRC. Is it in progress to fix it somewhere? or is it something I
> misunderstood?
>


Missing SparkR in CRAN

2019-01-24 Thread Hyukjin Kwon
Hi all,

I happened to find SparkR is missing in CRAN. See
https://cran.r-project.org/web/packages/SparkR/index.html

I remember I saw some threads about this in spark-dev mailing list a long
long ago IIRC. Is it in progress to fix it somewhere? or is it something I
misunderstood?


Re: Reading compacted Kafka topic is slow

2019-01-24 Thread Gabor Somogyi
Hi Tomas,

Presume the 60 sec window means trigger interval. Maybe a quick win could
be to try structured streaming because there the trigger interval is
optional.
If it is not specified, the system will check for availability of new data
as soon as the previous processing has completed.

BR,
G


On Thu, Jan 24, 2019 at 12:55 PM Tomas Bartalos 
wrote:

> Hello Spark folks,
>
> I'm reading compacted Kafka topic with spark 2.4, using direct stream -
> KafkaUtils.createDirectStream(...). I have configured necessary options for
> compacted stream, so its processed with CompactedKafkaRDDIterator.
> It works well, however in case of many gaps in the topic, the processing
> is very slow and 90% of time the executors are idle.
>
> I had a look to the source are here are my findings:
> Spark first computes number of records to stream from Kafka (processing
> rate * batch window size). # of records are translated to Kafka's
> (offset_from, offset_to) and eventually the Iterator reads records within
> the offset boundaries.
> This works fine until there are many gaps in the topic, which reduces the
> real number of processed records.
> Let's say we wanted to read 100k records in 60 sec window. With gaps it
> gets to 10k (because 90k are just compacted gaps) in 60 sec.
> As a result executor is working only 6 sec and 54 sec doing nothing.
> I'd like to utilize the executor as much as possible.
>
> A great feature would be to read 100k real records (skip the gaps) no
> matter what are the offsets.
>
> I've tried to make some improvement with backpressure and my custom
> RateEstimator (decorating PidRateEstimator and boosting the rate per
> second). And was even able to fully utilize the executors, but my approach
> have a big problem when compacted part of the topic meets non compacted
> part. The executor just tries to read a too big chunk of Kafka and the
> whole processing dies.
>
> BR,
> Tomas
>


Reading compacted Kafka topic is slow

2019-01-24 Thread Tomas Bartalos
Hello Spark folks,

I'm reading compacted Kafka topic with spark 2.4, using direct stream -
KafkaUtils.createDirectStream(...). I have configured necessary options for
compacted stream, so its processed with CompactedKafkaRDDIterator.
It works well, however in case of many gaps in the topic, the processing is
very slow and 90% of time the executors are idle.

I had a look to the source are here are my findings:
Spark first computes number of records to stream from Kafka (processing
rate * batch window size). # of records are translated to Kafka's
(offset_from, offset_to) and eventually the Iterator reads records within
the offset boundaries.
This works fine until there are many gaps in the topic, which reduces the
real number of processed records.
Let's say we wanted to read 100k records in 60 sec window. With gaps it
gets to 10k (because 90k are just compacted gaps) in 60 sec.
As a result executor is working only 6 sec and 54 sec doing nothing.
I'd like to utilize the executor as much as possible.

A great feature would be to read 100k real records (skip the gaps) no
matter what are the offsets.

I've tried to make some improvement with backpressure and my custom
RateEstimator (decorating PidRateEstimator and boosting the rate per
second). And was even able to fully utilize the executors, but my approach
have a big problem when compacted part of the topic meets non compacted
part. The executor just tries to read a too big chunk of Kafka and the
whole processing dies.

BR,
Tomas