Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-29 Thread Nicholas Chammas
Dunno if I made a silly mistake, but I wanted to bring some attention to
this issue in case there was something serious going on here that might
affect the upcoming release.

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25150

Nick


[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584239#comment-16584239
 ] 

Nicholas Chammas edited comment on SPARK-25150 at 8/17/18 6:15 PM:
---

I know there are a bunch of pending bug fixes in 2.3.2. I'm not sure if this is 
covered by any of them, and didn't have time to setup 2.3.2 to see if this 
problem is still present there. I will be away for some time and thought it 
best to report this now in case someone can pick it up and investigate further 
until I get back.

cc [~marmbrus].


was (Author: nchammas):
I know there are a bunch of pending bug fixes in 2.3.2. I'm not sure if this is 
covered by any of them, and didn't have time to setup 2.3.2 to see if this 
problem is still present there.

cc [~marmbrus].

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>    Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584239#comment-16584239
 ] 

Nicholas Chammas commented on SPARK-25150:
--

I know there are a bunch of pending bug fixes in 2.3.2. I'm not sure if this is 
covered by any of them, and didn't have time to setup 2.3.2 to see if this 
problem is still present there.

cc [~marmbrus].

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>    Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-25150:
-
Attachment: zombie-analysis.py
states.csv
persons.csv
output-without-implicit-cross-join.txt
output-with-implicit-cross-join.txt

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>    Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-25150:


 Summary: Joining DataFrames derived from the same source yields 
confusing/incorrect results
 Key: SPARK-25150
 URL: https://issues.apache.org/jira/browse/SPARK-25150
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Nicholas Chammas
 Attachments: output-with-implicit-cross-join.txt, 
output-without-implicit-cross-join.txt, persons.csv, states.csv, 
zombie-analysis.py

I have two DataFrames, A and B. From B, I have derived two additional 
DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
confusing error:
{code:java}
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
{code}
Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, Spark 
appears to give me incorrect answers.

I am not sure if I am missing something obvious, or if there is some kind of 
bug here. The "join condition is missing" error is confusing and doesn't make 
sense to me, and the seemingly incorrect output is concerning.

I've attached a reproduction, along with the output I'm seeing with and without 
the implicit cross join enabled.

I realize the join I've written is not correct in the sense that it should be 
left outer join instead of an inner join (since some of the aggregates are not 
available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [Python-ideas] Toxic forum

2018-08-13 Thread Nicholas Chammas
>From what I’ve seen on this list in my relatively brief time here, this
forum is mostly fine and the participants generally behave like adults. I
don’t read every thread, so maybe I don’t have an accurate picture. From
what I’ve seen, there is the occasional spat, where people just need to
step away from the keyboard for a bit and cool off, but I wouldn’t consider
the environment toxic. That seems like too strong a word.

What I feel we lack are better tools for checking bad behavior and nudging
people in the right direction, apart from having discussions like this.

Temporarily locking threads that have become heated comes to mind. As do
temporary bans for rowdy members. As does closing threads that have ceased
to serve any good purpose.

These things would have been pretty handy in several recent and heated
discussions, including the PEP 505 one I participated in a couple of weeks
ago.

How much of this is possible on a mailing list? I don’t know. But I
remember a year or two ago there was a proposal to move python-ideas to a
different format, perhaps a Discourse forum or similar, which would provide
some handy administration tools and other UX improvements.

That discussion, ironically, was beset by some of the very animosity that
better tooling would help control. (And I learned a bit about email
etiquette and top-posting...)

Maybe we need to revive that discussion? Overall, I don’t think we have a
people problem on this list as much as we have an administration tooling
problem.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] File format for automatic and manual tests

2018-08-09 Thread Nicholas Chammas
On Wed, Aug 8, 2018 at 5:09 AM Paul Moore  wrote:

> This strikes me as *absolutely* something that should be promoted
> outside of the stdlib, as a 3rd party project, and once it's
> established as a commonly used and accepted standard, only then
> propose that the stdlib offer support for it (if that's even needed at
> that point).
>
> Trying to promote a standard by making it "official" and then
> encouraging tools to accept it "because it's the official standard"
> seems like it's doing things backwards, to me at least.
>

+1
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Revisiting dedicated overloadable boolean operators

2018-08-03 Thread Nicholas Chammas
On Fri, Aug 3, 2018 at 1:47 PM Todd toddr...@gmail.com
 wrote:

The operators would be:
>
> bNOT - boolean "not"
> bAND - boolean "and"
> bOR - boolean "or"
> bXOR - boolean "xor"
>
These look pretty ugly to me. But that could just be a matter of
familiarity.

For what it’s worth, the Apache Spark project offers a popular DataFrame
API for querying tabular data, similar to Pandas. The project overloaded
the bitwise operators &, |, and ~ since they could not override the boolean
operators and, or, and not.

For example:

non_python_rhode_islanders = (
person
.where(~person['is_python_programmer'])
.where(person['state'] == 'RI' & person['age'] > 18)
.select('first_name', 'last_name')
)
non_python_rhode_islanders.show(20)

This did lead to confusion among users
 since people (myself
included) would initially try the boolean operators and wonder why they
weren’t working. So the Spark devs added a warning
 to catch when users were
making this mistake. But now it seems quite OK to me to use &, |, and ~ in
the context of Spark DataFrames, even though their use doesn’t match their
designed meaning. It’s unfortunate, but I think the Spark devs made a
practical choice that works well enough for their users.

PEP 335 would have addressed this issue by letting developers overload the
common boolean operators directly, but from what I gather of Guido’s
rejection
, the
biggest problem was that it would have had an undue performance impact on
non-users of boolean operator overloading. (Not sure if I interpreted his
email correctly.)
​
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 505: None-aware operators

2018-07-29 Thread Nicholas Chammas
On Sun, Jul 29, 2018 at 10:58 AM Steven D'Aprano 
wrote:

> On Sun, Jul 29, 2018 at 06:32:19AM -0400, David Mertz wrote:
> > On Sun, Jul 29, 2018, 2:00 AM Steven D'Aprano 
> wrote:
> >
> > > Fine. So it takes them an extra day to learn one more operator. Big
> > > deal. It is commonly believed to take ten years to master a field or
> > > language. Amortize that one day over ten years and its virtually
> > > nothing.
> > >
> >
> > This is where being wrong matters. The experience in this thread of most
> > supporters failing to get the semantics right shows that this isn't an
> > extra day to learn.
>
> The difficulty one or two people had in coming up with a correct
> equivalent to none-aware operators on the spur of the moment is simply
> not relevant. Aside from whichever developers implements the feature,
> the rest of us merely *use* it, just as we already use import,
> comprehensions, yield from, operators, class inheritence, and other
> features which are exceedingly difficult to emulate precisely in pure
> Python code.
>
> Even something as simple as the regular dot attribute lookup is
> difficult to emulate precisely. I doubt most people would be able to
> write a pure-Python version of getattr correctly the first time. Or even
> the fifth. I know I wouldn't.
>
> I'm sure that you are fully aware that if this proposal is accepted,
> people will not need to reinvent the wheel by emulating these none-aware
> operators in pure Python, so your repeated argument that (allegedly)
> even the supporters can't implement it correctly is pure FUD. They won't
> have to implement it, that's the point.


+1 to this, though I agree with Raymond's post that perhaps a breather on
language changes would be helpful until some of the recently introduced
features have become more familiar to folks.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Thu, Jul 26, 2018 at 12:17 AM David Mertz  wrote:

> On Thu, Jul 26, 2018 at 12:00 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Forgive me for being slow. I'm missing what's different in semantics
>> between the translation above and Chris's translation below:
>>
>
> You are VERY close now.  You have more SPAM, so yours is better:
>



Understood. Thanks for clarifying. I think Chris was referring to this in
an earlier message (when I made my first attempt at translating) when he
said:

> Aside from questions of repeated evaluation/assignment, yes. The broad
> semantics should be the same.

So the meaning of my translation matches the "intention" of the original
PEP 505 expression, but I see the danger in offering up such loose
translations. Getting the meaning right but missing important details (like
repeated evaluation) does not inspire confidence.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Wed, Jul 25, 2018 at 11:09 PM David Mertz  wrote:

> On Wed, Jul 25, 2018 at 10:50 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Indeed. Thanks for the counter-example. I think the correct translation
>> is as follows:
>> food = spam?.eggs?.bacon
>> Becomes:
>> food = None
>> if spam is not None and spam.eggs is not None:
>> food = spam.eggs.bacon
>>
> Did I get it right now? :)
>>
>
> Nope, still not right, I'm afraid!
>
> Chris Angelica provided a more accurate translation.
>

Forgive me for being slow. I'm missing what's different in semantics
between the translation above and Chris's translation below:

_tmp = spam
> if _tmp is not None:
> _tmp = _tmp.eggs
> if _tmp is not None:
> _tmp = _tmp.bacon
> food = _tmp
>

What's a case where they would do something different?

* If spam is None, they work the same -> None
* If spam is not None, but spam.eggs exists and is None, they work the same
-> None
* If spam is not None, but spam.eggs doesn't exist, they work the same ->
AttributeError
* If spam is not None, and spam.eggs is not None, but spam.eggs.bacon is
None, they work the same -> None
* If spam is not None, and spam.eggs is not None, but spam.eggs.bacon
doesn't exist, they work the same -> AttributeError
* If spam is not None, and spam.eggs is not None, and spam.eggs.bacon is
not None, they work the same -> bacon
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Wed, Jul 25, 2018 at 10:12 PM David Mertz  wrote:

> On Wed, Jul 25, 2018 at 9:47 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> > That is disingenuous, I think.  Can this raise an AttributeError?
>>> > spam?.eggs?.bacon
>>> > Of course it can! And this is exactly the pattern used in many
>>> examples in
>>> > the PEP and the discussion. So the PEP would create a situation where
>>> code
>>> > will raise AttributeError in a slightly—and subtly—different set of
>>> > circumstances than plain attribute access will.
>>>
>>
>
>> food = spam?.eggs?.bacon
>> Can be rewritten as:
>> food = spam
>> if spam is not None and spam.eggs is not None:
>> food = spam.eggs.bacon
>> They both behave identically, no? Maybe I missed the point David was
>> trying to make.
>>
>
> No, you illustrate it perfectly! I had to stare at your translation for a
> while to decide if it was really identical to the proposed
> `spam?.eggs?.bacon`.  The fact I have to think so hard makes the syntax
> feel non-obvious.
>
> Plus, there's the fact that your best effort at translating the proposed
> syntax is WRONG.  Even a strong proponent cannot explain the behavior on a
> first try.  And indeed, it behaves subtly different from plain attribute
> access in where it raises AttributeError.
>
> >>> spam = SimpleNamespace()
> >>> spam.eggs = None
> >>> spam.eggs.bacon
> AttributeError: 'NoneType' object has no attribute 'bacon'
>
> >>> # spam?.eggs?.bacon
> >>> # Should be: None
>
> >>> "Translation" does something different
> >>> food = spam
> >>> if spam is not None and spam.eggs is not None:
> ... food = spam.eggs.bacon
> >>> food
> namespace(eggs=None)
>

Indeed. Thanks for the counter-example. I think the correct translation is
as follows:

food = spam?.eggs?.bacon

Becomes:

food = None
if spam is not None and spam.eggs is not None:
food = spam.eggs.bacon

Did I get it right now? :)

What misled me was this example from the PEP
<https://www.python.org/dev/peps/pep-0505/#the-maybe-dot-and-maybe-subscript-operators>
showing
how atoms are evaluated. The breakdown begins with `_v = a`, so I copied
that pattern incorrectly when trying to explain how an example assignment
would work, instead of translating directly from what I understood the PEP
505 variant should do.

So, shame on me. I think this particular mistake reflects more on me than
on PEP 505, but I see how this kind of mistake reflects badly on the folks
advocating for the PEP (or at least, playing devil's advocate).
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Wed, Jul 25, 2018 at 9:20 PM Chris Angelico  wrote:

> On Thu, Jul 26, 2018 at 11:02 AM, David Mertz  wrote:
> > That is disingenuous, I think.  Can this raise an AttributeError?
> >
> > spam?.eggs?.bacon
> >
> > Of course it can! And this is exactly the pattern used in many examples
> in
> > the PEP and the discussion. So the PEP would create a situation where
> code
> > will raise AttributeError in a slightly—and subtly—different set of
> > circumstances than plain attribute access will.
>
> I don't understand. If it were to raise AttributeError, it would be
> because spam (or spam.eggs) isn't None, but doesn't have an attribute
> eggs (or bacon). Exactly the same as regular attribute access. How is
> it slightly different? Have I missed something?
>

That was my reaction, too.

food = spam?.eggs?.bacon

Can be rewritten as:

food = spam
if spam is not None and spam.eggs is not None:
food = spam.eggs.bacon

They both behave identically, no? Maybe I missed the point David was trying
to make.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Wed, Jul 25, 2018 at 6:11 PM Abe Dillon  wrote:

> The problem here is not whether it's explicit. It's about Readability and
> conciseness. Using symbols in place of words almost always harms
> readability in favor of conciseness.
>
> value = person.name if person.name else person
>
> almost reads like english (aside from being a weird and totally uncommon
> use case)
>
> value = person?.name
>
> Is a huge step towards the concise illegible soup of symbols that Perl is
> famous for. It's a huge No from me.
>

The two statements you wrote are not the same. The first statement will
error out if person is None. The proposed None-aware operators are
specifically designed to handle variables that may be None.

The first statement should instead read:

value = person.name if person is not None else person

That's what `value = person?.name` means.

As others have pointed out, I suppose the fact that multiple people have
messed up the meaning of the proposed operators is concerning. Perhaps the
PEP could be improved by adding some dead simple examples of each operator
and an equivalent statement that doesn't use the operator, to better
illustrate their meaning. But I gather that will do little in the way of
addressing some of the stronger objections raised here.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Wed, Jul 25, 2018 at 6:36 PM David Mertz  wrote:

> The fact that a while bunch have people have commented on this subthread
> while not recognizing that the semantics of the '?.' and the if blocks are
> entirely different suggests the operators are but magnets.
>
> On Wed, Jul 25, 2018, 5:17 PM Nicholas Chammas 
> wrote:
>
>> On Mon, Jul 23, 2018 at 6:05 PM Giampaolo Rodola' 
>> wrote:
>>
>>> This:
>>>
>>> v = a?.b
>>>
>>> ...*implicitly* checks if value is not None [and continues execution].
>>> This:
>>>
>>> v = a
>>> if a.b is not None:
>>> v = a.b
>>>
>>> ...*explicitly* checks if value is not None and continues execution.
>>>
>>
Sorry, lazy reading on my part. I skimmed the expanded form assuming it was
correct. I think it should instead read `if a is not None: ...`.

Is that what you're getting at?
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Wed, Jul 25, 2018 at 12:12 PM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> On Mon, Jul 23, 2018 at 6:05 PM Giampaolo Rodola' 
> wrote:
>
>> This:
>>
>> v = a?.b
>>
>> ...*implicitly* checks if value is not None [and continues execution].
>> This:
>>
>> v = a
>> if a.b is not None:
>> v = a.b
>>
>> ...*explicitly* checks if value is not None and continues execution.
>>
>
> I think both of those are equally explicit. It's just that one notation is
> more concise than the other. Explicitness and conciseness are related but
> different things.
>
> 
>

It looks like others already discussed this point later in the thread.
Apologies for rehashing the argument.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Mon, Jul 23, 2018 at 6:05 PM Giampaolo Rodola' 
wrote:

> This:
>
> v = a?.b
>
> ...*implicitly* checks if value is not None [and continues execution].
> This:
>
> v = a
> if a.b is not None:
> v = a.b
>
> ...*explicitly* checks if value is not None and continues execution.
>

I think both of those are equally explicit. It's just that one notation is
more concise than the other. Explicitness and conciseness are related but
different things.

When something is "explicit", as I understand it, that means it does what
it says on the cover. There is no unstated behavior. The plain meaning of
`v = a?.b` is that it expands to the longer form (`v = a; if a.b ...`), and
it is just as explicit.

This reminds me of something I read about once called Stroustrup's Rule
 [1]:

> For new features, people insist on LOUD explicit syntax.
> For established features, people want terse notation.

I think the "explicit vs. implicit" part of this discussion is probably
better expressed as a discussion about "loud vs. terse" syntax. None of the
operators in PEP 505 have implicit behavior, to the best of my
understanding. It's just that the operators are new and have terse
spellings.

As a point of comparison, I think a good example of implicit behavior is
type coercion. When you ask Python to add an int to a float

a = 3 + 4.5

all that you've explicitly asked for is the addition. However, Python
implicitly converts the 3 from an int to a float as part of the operation.
The type conversion isn't anywhere "on the cover" of the + operator. It's
implicit behavior.

[1] Bjarne Stroustrup makes the observation in this talk
 at
23:00.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


[Distutils] Re: Make an ordered list of sdists to be installed?

2018-07-23 Thread Nicholas Chammas
I don’t know the details, but I did read that Poetry has a sophisticated
dependency resolver.

https://github.com/sdispater/poetry

I don’t know if there is a way to access the resolver independently of the
tool, but perhaps it would provide a handy reference.
2018년 7월 23일 (월) 오전 5:49, Thomas Kluyver 님이 작성:

> Hi all,
>
> Do we know of any tool that can, given the name of one or more packages,
> follow dependency chains and produce a list of packages in the order they
> need to be installed, assuming every package needed will be built from
> source?
>
> Running "pip download --no-binary :all: ipython" gets me a set of sdists
> to be installed, but I lose any information about the order. I assume some
> packages will fail to build if their dependencies are not installed first,
> so the order is significant.
>
> Pip appears to keep track of the ordering internally: if I run "pip
> install --no-binary :all: ipython", all the dependencies are downloaded,
> and then the collected packages are installed starting from those with no
> dependencies and finishing with the package I requested. But I don't know
> of any way to get this information out of pip. Is there an option that I'm
> overlooking? Or some other tool that can do this?
>
> The use case I'm thinking about is to automatically generate instructions
> for a build system which separates the downloading and installing steps, so
> for each step it expects one or more URLs to download, along with
> instructions for how to install that piece. The installation steps
> shouldn't download further data. I could work around the issue by telling
> it to download all the sdists in a single step and then install in one shot
> with --no-index and --find-links. But it's more elegant - and better for
> caching - if we can install each package as a single step.
>
> Thanks,
> Thomas
> --
> Distutils-SIG mailing list -- distutils-sig@python.org
> To unsubscribe send an email to distutils-sig-le...@python.org
> https://mail.python.org/mm3/mailman3/lists/distutils-sig.python.org/
> Message archived at
> https://mail.python.org/mm3/archives/list/distutils-sig@python.org/message/LGTH3IYBMVKBS4PYGFJ6A7N5GW5ZKFUY/
>
--
Distutils-SIG mailing list -- distutils-sig@python.org
To unsubscribe send an email to distutils-sig-le...@python.org
https://mail.python.org/mm3/mailman3/lists/distutils-sig.python.org/
Message archived at 
https://mail.python.org/mm3/archives/list/distutils-sig@python.org/message/BSODWBIYPSOGHUWJRPZ4Y2SZSSU4ZDGG/


Re: Review notification bot

2018-07-22 Thread Nicholas Chammas
On this topic, I just stumbled on a GitHub feature called CODEOWNERS
<https://help.github.com/articles/about-codeowners/>. It lets you specify
owners of specific areas of the repository using the same syntax that
.gitignore uses. Here is CPython's CODEOWNERS file
<https://github.com/python/cpython/blob/master/.github/CODEOWNERS> for
reference.

Dunno if that would complement mention-bot (which Facebook is apparently no
longer maintaining <https://github.com/facebookarchive/mention-bot#readme>),
or if we can even use it given the ASF setup on GitHub. But I thought it
would be worth mentioning nonetheless.

On Sat, Jul 14, 2018 at 11:17 AM Holden Karau  wrote:

> Hearing no objections (and in a shout out to @ Nicholas Chammas who
> initially suggested mention-bot back in 2016) I've set up a copy of mention
> bot and run it against my own repo (looks like
> https://github.com/holdenk/spark-testing-base/pull/253 ).
>
> If no one objects I’ll ask infra to turn this on for Spark on a trial
> biases and we can revisit it based on how folks interact with it.
>
> On Wed, Jun 6, 2018 at 12:24 PM, Holden Karau 
> wrote:
>
>> So there are a few bots along this line in OSS. If no one objects I’ll
>> take a look and find one which matches our use case and try it out.
>>
>> On Wed, Jun 6, 2018 at 10:33 AM Sean Owen  wrote:
>>
>>> Certainly I will frequently dig through 'git blame' to figure out who
>>> might be the right reviewer. Maybe that's automatable -- ping the person
>>> who last touched the most lines touched by the PR? There might be some
>>> false positives there. And I suppose the downside is being pinged forever
>>> for some change that just isn't well considered or one of those accidental
>>> 100K-line PRs. So maybe some way to decline or silence is important, or
>>> maybe just ping once and leave it. Sure, a bot that just adds a "Would @foo
>>> like to review?" comment on Github? Sure seems worth trying if someone is
>>> willing to do the work to cook up the bot.
>>>
>>> On Wed, Jun 6, 2018 at 12:22 PM Holden Karau 
>>> wrote:
>>>
>>>> Hi friends,
>>>>
>>>> Was chatting with some folks at the summit and I was wondering how
>>>> people would feel about adding a review bot to ping folks. We already have
>>>> the review dashboard but I was thinking we could ping folks who were the
>>>> original authors of the code being changed whom might not be in the habit
>>>> of looking at the review dashboard.
>>>>
>>>> Cheers,
>>>>
>>>> Holden :)
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> --
> Twitter: https://twitter.com/holdenkarau
>


[jira] [Resolved] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-27 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved HADOOP-15559.
---
Resolution: Fixed

> Clarity on Spark compatibility with hadoop-aws
> --
>
> Key: HADOOP-15559
> URL: https://issues.apache.org/jira/browse/HADOOP-15559
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: documentation, fs/s3
>        Reporter: Nicholas Chammas
>Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use newer versions of Hadoop and 
> Hadoop libraries with Spark, so for example, running Spark built against 
> Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build 
> Spark explicitly against Hadoop 
> 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how 
> to pair Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with 
> {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
> yields the following error when I try to access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method 
> org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
>  from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748){code}
> So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
> release of Hadoop that Spark is built against. However, neither [this 
> page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
> one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
> pair the correct version of {{hadoop-aws}} with Spark.
> Would it be appropriate to add some guidance somewhere on what versions of 
> {{hadoop-aws}} work with what versions and builds of Spark? It would help 
> eliminate th

[jira] [Resolved] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-27 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved HADOOP-15559.
---
Resolution: Fixed

> Clarity on Spark compatibility with hadoop-aws
> --
>
> Key: HADOOP-15559
> URL: https://issues.apache.org/jira/browse/HADOOP-15559
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: documentation, fs/s3
>        Reporter: Nicholas Chammas
>Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use newer versions of Hadoop and 
> Hadoop libraries with Spark, so for example, running Spark built against 
> Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build 
> Spark explicitly against Hadoop 
> 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how 
> to pair Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with 
> {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
> yields the following error when I try to access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method 
> org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
>  from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748){code}
> So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
> release of Hadoop that Spark is built against. However, neither [this 
> page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
> one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
> pair the correct version of {{hadoop-aws}} with Spark.
> Would it be appropriate to add some guidance somewhere on what versions of 
> {{hadoop-aws}} work with what versions and builds of Spark? It would help 
> eliminate this 

[jira] [Commented] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-27 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525697#comment-16525697
 ] 

Nicholas Chammas commented on HADOOP-15559:
---

Looks good to me. I will consider raising the issue of building with 
{{-Phadoop-cloud}} on the Spark dev list.

> Clarity on Spark compatibility with hadoop-aws
> --
>
> Key: HADOOP-15559
> URL: https://issues.apache.org/jira/browse/HADOOP-15559
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: documentation, fs/s3
>        Reporter: Nicholas Chammas
>Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use newer versions of Hadoop and 
> Hadoop libraries with Spark, so for example, running Spark built against 
> Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build 
> Spark explicitly against Hadoop 
> 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how 
> to pair Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with 
> {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
> yields the following error when I try to access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method 
> org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
>  from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748){code}
> So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
> release of Hadoop that Spark is built against. However, neither [this 
> page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
> one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
> pair the correct version of {{hadoop-aws}} with Spark.
> Would it be appropria

[jira] [Comment Edited] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-26 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524469#comment-16524469
 ] 

Nicholas Chammas edited comment on HADOOP-15559 at 6/27/18 2:27 AM:


Hi [~ste...@apache.org] and thank you for the thorough response and references.

1. Is [the s3a troubleshooting 
guide|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md]
 published anywhere? Or is the GitHub URL the canonical URL? I feel like [S3 
Support in Apache Hadoop|https://wiki.apache.org/hadoop/AmazonS3] is the most 
visible bit of documentation about s3a. It would make sense to link to the 
troubleshooting guide from there. 

2. In my case, I am not adding the AWS SDK individually. By using {{pyspark 
--packages}} (or {{spark-submit --packages}}) with hadoop-aws, I understand 
that Spark automatically pulls transitive dependencies for me. So my focus has 
been to just get the mapping of Spark version to hadoop-aws version correct.

Additionally, I am trying really hard to stick to the default release builds of 
Spark, as opposed to building my own versions of Spark to use with 
[Flintrock|https://github.com/nchammas/flintrock]. Being able to spin Spark 
clusters up on EC2 by downloading Spark directly from the Apache mirror network 
means one less piece of infrastructure I have to maintain myself. So I'm trying 
not to get into the business of building Spark, though I am aware of 
{{-Phadoop-cloud}}.

Thankfully, it looks like [Spark 2.3.1 built against Hadoop 
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/] works with {{–packages 
"org.apache.hadoop:hadoop-aws:2.7.6"}}, and I suppose according to your comment 
in SPARK-22919 that is basically the version of hadoop-aws I need to use with 
these releases as long as Spark is built against Hadoop 2.7.

Does that sound about right to you?


was (Author: nchammas):
Hi [~ste...@apache.org] and thank you for the thorough response and references.
 # Is [the s3a troubleshooting 
guide|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md]
 published anywhere? Or is the GitHub URL the canonical URL? I feel like [S3 
Support in Apache Hadoop|https://wiki.apache.org/hadoop/AmazonS3] is the most 
visible bit of documentation about s3a. It would make sense to link to the 
troubleshooting guide from there.
 # In my case, I am not adding the AWS SDK individually. By using {{pyspark 
--packages}} (or {{spark-submit --packages}}) with hadoop-aws, I understand 
that Spark automatically pulls transitive dependencies for me. So my focus has 
been to just get the mapping of Spark version to hadoop-aws version correct.

Additionally, I am trying really hard to stick to the default release builds of 
Spark, as opposed to building my own versions of Spark to use with 
[Flintrock|https://github.com/nchammas/flintrock]. Being able to spin Spark 
clusters up on EC2 by downloading Spark directly from the Apache mirror network 
means one less piece of infrastructure I have to maintain myself. So I'm trying 
not to get into the business of building Spark, though I am aware of 
{{-Phadoop-cloud}}.

Thankfully, it looks like [Spark 2.3.1 built against Hadoop 
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/] works with {{–packages 
"org.apache.hadoop:hadoop-aws:2.7.6"}}, and I suppose according to your comment 
in SPARK-22919 that is basically the version of hadoop-aws I need to use with 
these releases as long as Spark is built against Hadoop 2.7.

Does that sound about right to you?

> Clarity on Spark compatibility with hadoop-aws
> --
>
> Key: HADOOP-15559
> URL: https://issues.apache.org/jira/browse/HADOOP-15559
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: documentation, fs/s3
>Reporter: Nicholas Chammas
>Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use newer versions of Hadoop and 
> Hadoop li

[jira] [Commented] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-26 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524469#comment-16524469
 ] 

Nicholas Chammas commented on HADOOP-15559:
---

Hi [~ste...@apache.org] and thank you for the thorough response and references.
 # Is [the s3a troubleshooting 
guide|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md]
 published anywhere? Or is the GitHub URL the canonical URL? I feel like [S3 
Support in Apache Hadoop|https://wiki.apache.org/hadoop/AmazonS3] is the most 
visible bit of documentation about s3a. It would make sense to link to the 
troubleshooting guide from there.
 # In my case, I am not adding the AWS SDK individually. By using {{pyspark 
--packages}} (or {{spark-submit --packages}}) with hadoop-aws, I understand 
that Spark automatically pulls transitive dependencies for me. So my focus has 
been to just get the mapping of Spark version to hadoop-aws version correct.

Additionally, I am trying really hard to stick to the default release builds of 
Spark, as opposed to building my own versions of Spark to use with 
[Flintrock|https://github.com/nchammas/flintrock]. Being able to spin Spark 
clusters up on EC2 by downloading Spark directly from the Apache mirror network 
means one less piece of infrastructure I have to maintain myself. So I'm trying 
not to get into the business of building Spark, though I am aware of 
{{-Phadoop-cloud}}.

Thankfully, it looks like [Spark 2.3.1 built against Hadoop 
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/] works with {{–packages 
"org.apache.hadoop:hadoop-aws:2.7.6"}}, and I suppose according to your comment 
in SPARK-22919 that is basically the version of hadoop-aws I need to use with 
these releases as long as Spark is built against Hadoop 2.7.

Does that sound about right to you?

> Clarity on Spark compatibility with hadoop-aws
> --
>
> Key: HADOOP-15559
> URL: https://issues.apache.org/jira/browse/HADOOP-15559
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: documentation, fs/s3
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use newer versions of Hadoop and 
> Hadoop libraries with Spark, so for example, running Spark built against 
> Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build 
> Spark explicitly against Hadoop 
> 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how 
> to pair Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with 
> {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
> yields the following error when I try to access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method 
> org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
>  from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache

[jira] [Created] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-25 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created HADOOP-15559:
-

 Summary: Clarity on Spark compatibility with hadoop-aws
 Key: HADOOP-15559
 URL: https://issues.apache.org/jira/browse/HADOOP-15559
 Project: Hadoop Common
  Issue Type: Improvement
  Components: documentation, fs/s3
Reporter: Nicholas Chammas


I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
command-line tool for launching Apache Spark clusters on AWS. One of the things 
I try to do for my users is make it straightforward to use Spark with 
{{s3a://}}. I do this by recommending that users start Spark with the 
{{hadoop-aws}} package.

For example:
{code:java}
pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
{code}
I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
work with what versions of Spark.

Spark releases are [built against Hadoop 
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, I've 
been told that I should be able to use newer versions of Hadoop and Hadoop 
libraries with Spark, so for example, running Spark built against Hadoop 2.7 
alongside HDFS 2.8 should work, and there is [no need to build Spark explicitly 
against Hadoop 
2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].

I'm having trouble translating this mental model into recommendations for how 
to pair Spark with {{hadoop-aws}}.

For example, Spark 2.3.1 built against Hadoop 2.7 works with 
{{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
yields the following error when I try to access files via {{s3a://}}.
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
: java.lang.IllegalAccessError: tried to access method 
org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
 from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748){code}
So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
release of Hadoop that Spark is built against. However, neither [this 
page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
pair the correct version of {{hadoop-aws}} with Spark.

Would it be appropriate to add some guidance somewhere on what versions of 
{{hadoop-aws}} work with what versions and builds of Spark? It would help 
eliminate this kind of guesswork and slow spelunking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-25 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created HADOOP-15559:
-

 Summary: Clarity on Spark compatibility with hadoop-aws
 Key: HADOOP-15559
 URL: https://issues.apache.org/jira/browse/HADOOP-15559
 Project: Hadoop Common
  Issue Type: Improvement
  Components: documentation, fs/s3
Reporter: Nicholas Chammas


I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
command-line tool for launching Apache Spark clusters on AWS. One of the things 
I try to do for my users is make it straightforward to use Spark with 
{{s3a://}}. I do this by recommending that users start Spark with the 
{{hadoop-aws}} package.

For example:
{code:java}
pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
{code}
I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
work with what versions of Spark.

Spark releases are [built against Hadoop 
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, I've 
been told that I should be able to use newer versions of Hadoop and Hadoop 
libraries with Spark, so for example, running Spark built against Hadoop 2.7 
alongside HDFS 2.8 should work, and there is [no need to build Spark explicitly 
against Hadoop 
2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].

I'm having trouble translating this mental model into recommendations for how 
to pair Spark with {{hadoop-aws}}.

For example, Spark 2.3.1 built against Hadoop 2.7 works with 
{{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
yields the following error when I try to access files via {{s3a://}}.
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
: java.lang.IllegalAccessError: tried to access method 
org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
 from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748){code}
So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
release of Hadoop that Spark is built against. However, neither [this 
page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
pair the correct version of {{hadoop-aws}} with Spark.

Would it be appropriate to add some guidance somewhere on what versions of 
{{hadoop-aws}} work with what versions and builds of Spark? It would help 
eliminate this kind of guesswork and slow spelunking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-02 Thread Nicholas Chammas
I'll give that a try, but I'll still have to figure out what to do if none
of the release builds work with hadoop-aws, since Flintrock deploys Spark
release builds to set up a cluster. Building Spark is slow, so we only do
it if the user specifically requests a Spark version by git hash. (This is
basically how spark-ec2 did things, too.)

On Sat, Jun 2, 2018 at 6:54 PM Marcelo Vanzin  wrote:

> If you're building your own Spark, definitely try the hadoop-cloud
> profile. Then you don't even need to pull anything at runtime,
> everything is already packaged with Spark.
>
> On Fri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas
>  wrote:
> > pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work for me
> > either (even building with -Phadoop-2.7). I guess I’ve been relying on an
> > unsupported pattern and will need to figure something else out going
> forward
> > in order to use s3a://.
> >
> >
> > On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin 
> wrote:
> >>
> >> I have personally never tried to include hadoop-aws that way. But at
> >> the very least, I'd try to use the same version of Hadoop as the Spark
> >> build (2.7.3 IIRC). I don't really expect a different version to work,
> >> and if it did in the past it definitely was not by design.
> >>
> >> On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas
> >>  wrote:
> >> > Building with -Phadoop-2.7 didn’t help, and if I remember correctly,
> >> > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0
> release,
> >> > so
> >> > it appears something has changed since then.
> >> >
> >> > I wasn’t familiar with -Phadoop-cloud, but I can try that.
> >> >
> >> > My goal here is simply to confirm that this release of Spark works
> with
> >> > hadoop-aws like past releases did, particularly for Flintrock users
> who
> >> > use
> >> > Spark with S3A.
> >> >
> >> > We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop
> builds
> >> > with
> >> > every Spark release. If the -hadoop2.7 release build won’t work with
> >> > hadoop-aws anymore, are there plans to provide a new build type that
> >> > will?
> >> >
> >> > Apologies if the question is poorly formed. I’m batting a bit outside
> my
> >> > league here. Again, my goal is simply to confirm that I/my users still
> >> > have
> >> > a way to use s3a://. In the past, that way was simply to call pyspark
> >> > --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very
> similar.
> >> > If
> >> > that will no longer work, I’m trying to confirm that the change of
> >> > behavior
> >> > is intentional or acceptable (as a review for the Spark project) and
> >> > figure
> >> > out what I need to change (as due diligence for Flintrock’s users).
> >> >
> >> > Nick
> >> >
> >> >
> >> > On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin 
> >> > wrote:
> >> >>
> >> >> Using the hadoop-aws package is probably going to be a little more
> >> >> complicated than that. The best bet is to use a custom build of Spark
> >> >> that includes it (use -Phadoop-cloud). Otherwise you're probably
> >> >> looking at some nasty dependency issues, especially if you end up
> >> >> mixing different versions of Hadoop.
> >> >>
> >> >> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
> >> >>  wrote:
> >> >> > I was able to successfully launch a Spark cluster on EC2 at 2.3.1
> RC4
> >> >> > using
> >> >> > Flintrock. However, trying to load the hadoop-aws package gave me
> >> >> > some
> >> >> > errors.
> >> >> >
> >> >> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
> >> >> >
> >> >> > 
> >> >> >
> >> >> > :: problems summary ::
> >> >> >  WARNINGS
> >> >> > [NOT FOUND  ]
> >> >> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
> >> >> >  local-m2-cache: tried
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar
> >> >> >

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work for me
either (even building with -Phadoop-2.7). I guess I’ve been relying on an
unsupported pattern and will need to figure something else out going
forward in order to use s3a://.
​

On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin  wrote:

> I have personally never tried to include hadoop-aws that way. But at
> the very least, I'd try to use the same version of Hadoop as the Spark
> build (2.7.3 IIRC). I don't really expect a different version to work,
> and if it did in the past it definitely was not by design.
>
> On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas
>  wrote:
> > Building with -Phadoop-2.7 didn’t help, and if I remember correctly,
> > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0 release,
> so
> > it appears something has changed since then.
> >
> > I wasn’t familiar with -Phadoop-cloud, but I can try that.
> >
> > My goal here is simply to confirm that this release of Spark works with
> > hadoop-aws like past releases did, particularly for Flintrock users who
> use
> > Spark with S3A.
> >
> > We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop builds
> with
> > every Spark release. If the -hadoop2.7 release build won’t work with
> > hadoop-aws anymore, are there plans to provide a new build type that
> will?
> >
> > Apologies if the question is poorly formed. I’m batting a bit outside my
> > league here. Again, my goal is simply to confirm that I/my users still
> have
> > a way to use s3a://. In the past, that way was simply to call pyspark
> > --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very similar.
> If
> > that will no longer work, I’m trying to confirm that the change of
> behavior
> > is intentional or acceptable (as a review for the Spark project) and
> figure
> > out what I need to change (as due diligence for Flintrock’s users).
> >
> > Nick
> >
> >
> > On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin 
> wrote:
> >>
> >> Using the hadoop-aws package is probably going to be a little more
> >> complicated than that. The best bet is to use a custom build of Spark
> >> that includes it (use -Phadoop-cloud). Otherwise you're probably
> >> looking at some nasty dependency issues, especially if you end up
> >> mixing different versions of Hadoop.
> >>
> >> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
> >>  wrote:
> >> > I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4
> >> > using
> >> > Flintrock. However, trying to load the hadoop-aws package gave me some
> >> > errors.
> >> >
> >> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
> >> >
> >> > 
> >> >
> >> > :: problems summary ::
> >> >  WARNINGS
> >> > [NOT FOUND  ]
> >> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
> >> >  local-m2-cache: tried
> >> >
> >> >
> >> >
> file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar
> >> > [NOT FOUND  ]
> >> > com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms)
> >> >  local-m2-cache: tried
> >> >
> >> >
> >> >
> file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar
> >> > [NOT FOUND  ]
> >> > org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)
> >> >  local-m2-cache: tried
> >> >
> >> >
> >> >
> file:/home/ec2-user/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar
> >> > [NOT FOUND  ]
> >> > com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar (0ms)
> >> >  local-m2-cache: tried
> >> >
> >> >
> >> >
> file:/home/ec2-user/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar
> >> >
> >> > I’d guess I’m probably using the wrong version of hadoop-aws, but I
> >> > called
> >> > make-distribution.sh with -Phadoop-2.8 so I’m not sure what else to
> try.
> >> >
> >> > Any quick pointers?
> >> >
> >> > Nick
> >> >
> >> >
> >> > On Fri, Jun 1, 2018 at 6:29 PM Marcelo Vanzin 
> >> > wrote:
> >> >>
> >> >> Starting with my own +1 

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
Building with -Phadoop-2.7 didn’t help, and if I remember correctly,
building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0 release, so
it appears something has changed since then.

I wasn’t familiar with -Phadoop-cloud, but I can try that.

My goal here is simply to confirm that this release of Spark works with
hadoop-aws like past releases did, particularly for Flintrock users who use
Spark with S3A.

We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop builds
with every Spark release. If the -hadoop2.7 release build won’t work with
hadoop-aws anymore, are there plans to provide a new build type that will?

Apologies if the question is poorly formed. I’m batting a bit outside my
league here. Again, my goal is simply to confirm that I/my users still have
a way to use s3a://. In the past, that way was simply to call pyspark
--packages org.apache.hadoop:hadoop-aws:2.8.4 or something very similar. If
that will no longer work, I’m trying to confirm that the change of behavior
is intentional or acceptable (as a review for the Spark project) and figure
out what I need to change (as due diligence for Flintrock’s users).

Nick
​

On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin  wrote:

> Using the hadoop-aws package is probably going to be a little more
> complicated than that. The best bet is to use a custom build of Spark
> that includes it (use -Phadoop-cloud). Otherwise you're probably
> looking at some nasty dependency issues, especially if you end up
> mixing different versions of Hadoop.
>
> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
>  wrote:
> > I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4
> using
> > Flintrock. However, trying to load the hadoop-aws package gave me some
> > errors.
> >
> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
> >
> > 
> >
> > :: problems summary ::
> >  WARNINGS
> > [NOT FOUND  ]
> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
> >  local-m2-cache: tried
> >
> >
> file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar
> > [NOT FOUND  ]
> > com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms)
> >  local-m2-cache: tried
> >
> >
> file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar
> > [NOT FOUND  ]
> > org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)
> >  local-m2-cache: tried
> >
> >
> file:/home/ec2-user/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar
> > [NOT FOUND  ]
> > com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar (0ms)
> >  local-m2-cache: tried
> >
> >
> file:/home/ec2-user/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar
> >
> > I’d guess I’m probably using the wrong version of hadoop-aws, but I
> called
> > make-distribution.sh with -Phadoop-2.8 so I’m not sure what else to try.
> >
> > Any quick pointers?
> >
> > Nick
> >
> >
> > On Fri, Jun 1, 2018 at 6:29 PM Marcelo Vanzin 
> wrote:
> >>
> >> Starting with my own +1 (binding).
> >>
> >> On Fri, Jun 1, 2018 at 3:28 PM, Marcelo Vanzin 
> >> wrote:
> >> > Please vote on releasing the following candidate as Apache Spark
> version
> >> > 2.3.1.
> >> >
> >> > Given that I expect at least a few people to be busy with Spark Summit
> >> > next
> >> > week, I'm taking the liberty of setting an extended voting period. The
> >> > vote
> >> > will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
> >> >
> >> > It passes with a majority of +1 votes, which must include at least 3
> +1
> >> > votes
> >> > from the PMC.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 2.3.1
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see http://spark.apache.org/
> >> >
> >> > The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
> >> > https://github.com/apache/spark/tree/v2.3.1-rc4
> >> >
> >> > The release files, including signatures, digests, etc. can be found
> at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
> >> >
> >> > Signatures used for Spark RCs can be found in this file:
> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >
> >&

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4 using
Flintrock . However, trying to load
the hadoop-aws package gave me some errors.

$ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4



:: problems summary ::
 WARNINGS
[NOT FOUND  ]
com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
 local-m2-cache: tried
  
file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar
[NOT FOUND  ]
com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms)
 local-m2-cache: tried
  
file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar
[NOT FOUND  ]
org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)
 local-m2-cache: tried
  
file:/home/ec2-user/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar
[NOT FOUND  ]
com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar (0ms)
 local-m2-cache: tried
  
file:/home/ec2-user/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar

I’d guess I’m probably using the wrong version of hadoop-aws, but I called
make-distribution.sh with -Phadoop-2.8 so I’m not sure what else to try.

Any quick pointers?

Nick
​

On Fri, Jun 1, 2018 at 6:29 PM Marcelo Vanzin  wrote:

> Starting with my own +1 (binding).
>
> On Fri, Jun 1, 2018 at 3:28 PM, Marcelo Vanzin 
> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> 2.3.1.
> >
> > Given that I expect at least a few people to be busy with Spark Summit
> next
> > week, I'm taking the liberty of setting an extended voting period. The
> vote
> > will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
> >
> > It passes with a majority of +1 votes, which must include at least 3 +1
> votes
> > from the PMC.
> >
> > [ ] +1 Release this package as Apache Spark 2.3.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
> > https://github.com/apache/spark/tree/v2.3.1-rc4
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1272/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
> >
> > The list of bug fixes going into 2.3.1 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12342432
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.3.1?
> > ===
> >
> > The current list of open tickets targeted at 2.3.1 can be found at:
> > https://s.apache.org/Q3Uo
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
> >
> > --
> > Marcelo
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[issue22269] Resolve distutils option conflicts with priorities

2018-05-13 Thread Nicholas Chammas

Change by Nicholas Chammas <nicholas.cham...@gmail.com>:


--
nosy: +nchammas

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue22269>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: Documenting the various DataFrame/SQL join types

2018-05-08 Thread Nicholas Chammas
OK great, I’m happy to take this on.

Does it make sense to approach this by adding an example for each join type
here
<https://github.com/apache/spark/blob/master/examples/src/main/python/sql/basic.py>
(and perhaps also in the matching areas for Scala, Java, and R), and then
referencing the examples from the SQL Programming Guide
<https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md>
using include_example tags?

e.g.:


{% include_example write_sorting_and_bucketing python/sql/datasource.py %}

And would this let me implement simple tests for the examples? It’s not
clear to me whether the comment blocks in that example file are used for
testing somehow.

Just looking for some high level guidance.

Nick
​

On Tue, May 8, 2018 at 11:42 AM Reynold Xin <r...@databricks.com> wrote:

> Would be great to document. Probably best with examples.
>
> On Tue, May 8, 2018 at 6:13 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> The documentation for DataFrame.join()
>> <https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join>
>> lists all the join types we support:
>>
>>- inner
>>- cross
>>- outer
>>- full
>>- full_outer
>>- left
>>- left_outer
>>- right
>>- right_outer
>>- left_semi
>>- left_anti
>>
>> Some of these join types are also listed on the SQL Programming Guide
>> <http://spark.apache.org/docs/2.3.0/sql-programming-guide.html#supported-hive-features>
>> .
>>
>> Is it obvious to everyone what all these different join types are? For
>> example, I had never heard of a LEFT ANTI join until stumbling on it in the
>> PySpark docs. It’s quite handy! But I had to experiment with it a bit just
>> to understand what it does.
>>
>> I think it would be a good service to our users if we either documented
>> these join types ourselves clearly, or provided a link to an external
>> resource that documented them sufficiently. I’m happy to file a JIRA about
>> this and do the work itself. It would be great if the documentation could
>> be expressed as a series of simple doc tests, but brief prose describing
>> how each join works would still be valuable.
>>
>> Does this seem worthwhile to folks here? And does anyone want to offer
>> guidance on how best to provide this kind of documentation so that it’s
>> easy to find by users, regardless of the language they’re using?
>>
>> Nick
>> ​
>>
>


Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Nicholas Chammas
That’s correct. I probably would have done better to title this thread
something like “How to effectively track and release persisted DataFrames”.

I jumped the gun in my initial email by referencing getPersistentRDDs() as
a potential solution, but in theory the desired API is something like
spark.unpersistAllExcept([list
of DataFrames or RDDs]). This seems awkward, but I suspect the underlying
use case is common.

An alternative or complementary approach, perhaps, would be to allow
persistence (and perhaps even checkpointing) to be explicitly scoped
<https://issues.apache.org/jira/browse/SPARK-16921>. I think in some
circles this is called “Scope-based Resource Management” or “Resource
acquisition is initialization” (RAII). It would make it a lot easier to
track and release DataFrames or RDDs when they are no longer needed in
cache.

Nick

2018년 5월 8일 (화) 오후 1:32, Mark Hamstra <m...@clearstorydata.com>님이 작성:

If I am understanding you correctly, you're just saying that the problem is
> that you know what you want to keep, not what you want to throw away, and
> that there is no unpersist DataFrames call based on that what-to-keep
> information.
>
> On Tue, May 8, 2018 at 6:00 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I certainly can, but the problem I’m facing is that of how best to track
>> all the DataFrames I no longer want to persist.
>>
>> I create and persist various DataFrames throughout my pipeline. Spark is
>> already tracking all this for me, and exposing some of that tracking
>> information via getPersistentRDDs(). So when I arrive at a point in my
>> program where I know, “I only need this DataFrame going forward”, I want to
>> be able to tell Spark “Please unpersist everything except this one
>> DataFrame”. If I cannot leverage the information about persisted DataFrames
>> that Spark is already tracking, then the alternative is for me to carefully
>> track and unpersist DataFrames when I no longer need them.
>>
>> I suppose the problem is similar at a high level to garbage collection.
>> Tracking and freeing DataFrames manually is analogous to malloc and free;
>> and full automation would be Spark automatically unpersisting DataFrames
>> when they were no longer referenced or needed. I’m looking for an
>> in-between solution that lets me leverage some of the persistence tracking
>> in Spark so I don’t have to do it all myself.
>>
>> Does this make more sense now, from a use case perspective as well as
>> from a desired API perspective?
>> ​
>>
>> On Thu, May 3, 2018 at 10:26 PM Reynold Xin <r...@databricks.com> wrote:
>>
>>> Why do you need the underlying RDDs? Can't you just unpersist the
>>> dataframes that you don't need?
>>>
>>>
>>> On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> This seems to be an underexposed part of the API. My use case is this:
>>>> I want to unpersist all DataFrames except a specific few. I want to do this
>>>> because I know at a specific point in my pipeline that I have a handful of
>>>> DataFrames that I need, and everything else is no longer needed.
>>>>
>>>> The problem is that there doesn’t appear to be a way to identify
>>>> specific DataFrames (or rather, their underlying RDDs) via
>>>> getPersistentRDDs(), which is the only way I’m aware of to ask Spark
>>>> for all currently persisted RDDs:
>>>>
>>>> >>> a = spark.range(10).persist()>>> a.rdd.id()8>>> 
>>>> >>> list(spark.sparkContext._jsc.getPersistentRDDs().items())
>>>> [(3, JavaObject id=o36)]
>>>>
>>>> As you can see, the id of the persisted RDD, 8, doesn’t match the id
>>>> returned by getPersistentRDDs(), 3. So I can’t go through the RDDs
>>>> returned by getPersistentRDDs() and know which ones I want to keep.
>>>>
>>>> id() itself appears to be an undocumented method of the RDD API, and
>>>> in PySpark getPersistentRDDs() is buried behind the Java sub-objects
>>>> <https://issues.apache.org/jira/browse/SPARK-2141>, so I know I’m
>>>> reaching here. But is there a way to do what I want in PySpark without
>>>> manually tracking everything I’ve persisted myself?
>>>>
>>>> And more broadly speaking, do we want to add additional APIs, or
>>>> formalize currently undocumented APIs like id(), to make this use case
>>>> possible?
>>>>
>>>> Nick
>>>> ​
>>>>
>>>
> ​


Re: eager execution and debuggability

2018-05-08 Thread Nicholas Chammas
This may be technically impractical, but it would be fantastic if we could
make it easier to debug Spark programs without needing to rely on eager
execution. Sprinkling .count() and .checkpoint() at various points in my
code is still a debugging technique I use, but it always makes me wish
Spark could point more directly to the offending transformation when
something goes wrong.

Is it somehow possible to have each individual operator (is that the
correct term?) in a DAG include metadata pointing back to the line of code
that generated the operator? That way when an action triggers an error, the
failing operation can point to the relevant line of code — even if it’s a
transformation — and not just the action on the tail end that triggered the
error.

I don’t know how feasible this is, but addressing it would directly solve
the issue of linking failures to the responsible transformation, as opposed
to leaving the user to break up a chain of transformations with several
debug actions. And this would benefit new and experienced users alike.

Nick

2018년 5월 8일 (화) 오후 7:09, Ryan Blue rb...@netflix.com.invalid
님이 작성:

I've opened SPARK-24215 to track this.
>
> On Tue, May 8, 2018 at 3:58 PM, Reynold Xin  wrote:
>
>> Yup. Sounds great. This is something simple Spark can do and provide huge
>> value to the end users.
>>
>>
>> On Tue, May 8, 2018 at 3:53 PM Ryan Blue  wrote:
>>
>>> Would be great if it is something more turn-key.
>>>
>>> We can easily add the __repr__ and _repr_html_ methods and behavior to
>>> PySpark classes. We could also add a configuration property to determine
>>> whether the dataset evaluation is eager or not. That would make it turn-key
>>> for anyone running PySpark in Jupyter.
>>>
>>> For JVM languages, we could also add a dependency on jvm-repr and do the
>>> same thing.
>>>
>>> rb
>>> ​
>>>
>>> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin  wrote:
>>>
 s/underestimated/overestimated/

 On Tue, May 8, 2018 at 3:44 PM Reynold Xin  wrote:

> Marco,
>
> There is understanding how Spark works, and there is finding bugs
> early in their own program. One can perfectly understand how Spark works
> and still find it valuable to get feedback asap, and that's why we built
> eager analysis in the first place.
>
> Also I'm afraid you've significantly underestimated the level of
> technical sophistication of users. In many cases they struggle to get
> anything to work, and performance optimization of their programs is
> secondary to getting things working. As John Ousterhout says, "the 
> greatest
> performance improvement of all is when a system goes from not-working to
> working".
>
> I really like Ryan's approach. Would be great if it is something more
> turn-key.
>
>
>
>
>
>
> On Tue, May 8, 2018 at 2:35 PM Marco Gaido 
> wrote:
>
>> I am not sure how this is useful. For students, it is important to
>> understand how Spark works. This can be critical in many decision they 
>> have
>> to take (whether and what to cache for instance) in order to have
>> performant Spark application. Creating a eager execution probably can 
>> help
>> them having something running more easily, but let them also using Spark
>> knowing less about how it works, thus they are likely to write worse
>> application and to have more problems in debugging any kind of problem
>> which may later (in production) occur (therefore affecting their 
>> experience
>> with the tool).
>>
>> Moreover, as Ryan also mentioned, there are tools/ways to force the
>> execution, helping in the debugging phase. So they can achieve without a
>> big effort the same result, but with a big difference: they are aware of
>> what is really happening, which may help them later.
>>
>> Thanks,
>> Marco
>>
>> 2018-05-08 21:37 GMT+02:00 Ryan Blue :
>>
>>> At Netflix, we use Jupyter notebooks and consoles for interactive
>>> sessions. For anyone interested, this mode of interaction is really 
>>> easy to
>>> add in Jupyter and PySpark. You would just define a different
>>> *repr_html* or *repr* method for Dataset that runs a take(10) or
>>> take(100) and formats the result.
>>>
>>> That way, the output of a cell or console execution always causes
>>> the dataframe to run and get displayed for that immediate feedback. But,
>>> there is no change to Spark’s behavior because the action is run by the
>>> REPL, and only when a dataframe is a result of an execution in order to
>>> display it. Intermediate results wouldn’t be run, but that gives users a
>>> way to avoid too many executions and would still support method 
>>> 

[jira] [Comment Edited] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-05-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468045#comment-16468045
 ] 

Nicholas Chammas edited comment on SPARK-23945 at 5/8/18 10:22 PM:
---

{quote}So in the grand scheme of things I'd expect DataFrames to be able to do 
everything that SQL can and vice versa
{quote}
Since writing this, I realized that the DataFrame API is able to express `IN` 
and `NOT IN` via an inner join and left anti join respectively. And I suspect 
most other cases where I may have thought the DataFrame API is less powerful 
than SQL are incorrect. The various DataFrame join types basically cover a lot 
of the stuff you'd want to do with subqueries.

So I'd actually be fine with closing this out as "Won't Fix" and instructing 
users, in the particular example I provided above, to express their query as 
follows:
{code:java}
(table1
.join(
table2,
on='name',
how='left_anti',
)
){code}
This is equivalent to the SQL query I posted, and does not require that 
anything be collected locally, so it scales just as well.

[~hvanhovell] - Does this make sense?


was (Author: nchammas):
> So in the grand scheme of things I'd expect DataFrames to be able to do 
> everything that SQL can and vice versa

 

Since writing this, I realized that the DataFrame API is able to express `IN` 
and `NOT IN` via an inner join and left anti join respectively. And I suspect 
most other cases where I may have thought the DataFrame API is less powerful 
than SQL are incorrect. The various DataFrame join types basically cover a lot 
of the stuff you'd want to do with subqueries.

So I'd actually be fine with closing this out as "Won't Fix" and instructing 
users, in the particular example I provided above, to express their query as 
follows:
{code:java}
(table1
.join(
table2,
on='name',
how='left_anti',
)
){code}
This is equivalent to the SQL query I posted, and does not require that 
anything be collected locally, so it scales just as well.

[~hvanhovell] - Does this make sense?

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-05-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468045#comment-16468045
 ] 

Nicholas Chammas commented on SPARK-23945:
--

> So in the grand scheme of things I'd expect DataFrames to be able to do 
> everything that SQL can and vice versa

 

Since writing this, I realized that the DataFrame API is able to express `IN` 
and `NOT IN` via an inner join and left anti join respectively. And I suspect 
most other cases where I may have thought the DataFrame API is less powerful 
than SQL are incorrect. The various DataFrame join types basically cover a lot 
of the stuff you'd want to do with subqueries.

So I'd actually be fine with closing this out as "Won't Fix" and instructing 
users, in the particular example I provided above, to express their query as 
follows:
{code:java}
(table1
.join(
table2,
on='name',
how='left_anti',
)
){code}
This is equivalent to the SQL query I posted, and does not require that 
anything be collected locally, so it scales just as well.

[~hvanhovell] - Does this make sense?

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-05-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433316#comment-16433316
 ] 

Nicholas Chammas edited comment on SPARK-23945 at 5/8/18 10:13 PM:
---

I always looked at DataFrames and SQL as two different interfaces to the same 
underlying logical model, so I just assumed that the vision was for them to be 
equally powerful. Is that not the case?

So in the grand scheme of things I'd expect DataFrames to be able to do 
everything that SQL can and vice versa, but for the narrow purposes of this 
ticket I'm just interested in {{IN and NOT IN.}}


was (Author: nchammas):
I always looked at DataFrames and SQL as two different interfaces to the same 
underlying logical model, so I just assumed that the vision was for them to be 
equally powerful. Is that not the case?

So in the grand scheme of things I'd expect DataFrames to be able to do 
everything that SQL can and vice versa, but for the narrow purposes of this 
ticket I'm just interested in {{IN }}and {{NOT IN.}}

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Documenting the various DataFrame/SQL join types

2018-05-08 Thread Nicholas Chammas
The documentation for DataFrame.join()

lists all the join types we support:

   - inner
   - cross
   - outer
   - full
   - full_outer
   - left
   - left_outer
   - right
   - right_outer
   - left_semi
   - left_anti

Some of these join types are also listed on the SQL Programming Guide

.

Is it obvious to everyone what all these different join types are? For
example, I had never heard of a LEFT ANTI join until stumbling on it in the
PySpark docs. It’s quite handy! But I had to experiment with it a bit just
to understand what it does.

I think it would be a good service to our users if we either documented
these join types ourselves clearly, or provided a link to an external
resource that documented them sufficiently. I’m happy to file a JIRA about
this and do the work itself. It would be great if the documentation could
be expressed as a series of simple doc tests, but brief prose describing
how each join works would still be valuable.

Does this seem worthwhile to folks here? And does anyone want to offer
guidance on how best to provide this kind of documentation so that it’s
easy to find by users, regardless of the language they’re using?

Nick
​


Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Nicholas Chammas
I certainly can, but the problem I’m facing is that of how best to track
all the DataFrames I no longer want to persist.

I create and persist various DataFrames throughout my pipeline. Spark is
already tracking all this for me, and exposing some of that tracking
information via getPersistentRDDs(). So when I arrive at a point in my
program where I know, “I only need this DataFrame going forward”, I want to
be able to tell Spark “Please unpersist everything except this one
DataFrame”. If I cannot leverage the information about persisted DataFrames
that Spark is already tracking, then the alternative is for me to carefully
track and unpersist DataFrames when I no longer need them.

I suppose the problem is similar at a high level to garbage collection.
Tracking and freeing DataFrames manually is analogous to malloc and free;
and full automation would be Spark automatically unpersisting DataFrames
when they were no longer referenced or needed. I’m looking for an
in-between solution that lets me leverage some of the persistence tracking
in Spark so I don’t have to do it all myself.

Does this make more sense now, from a use case perspective as well as from
a desired API perspective?
​

On Thu, May 3, 2018 at 10:26 PM Reynold Xin <r...@databricks.com> wrote:

> Why do you need the underlying RDDs? Can't you just unpersist the
> dataframes that you don't need?
>
>
> On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> This seems to be an underexposed part of the API. My use case is this: I
>> want to unpersist all DataFrames except a specific few. I want to do this
>> because I know at a specific point in my pipeline that I have a handful of
>> DataFrames that I need, and everything else is no longer needed.
>>
>> The problem is that there doesn’t appear to be a way to identify specific
>> DataFrames (or rather, their underlying RDDs) via getPersistentRDDs(),
>> which is the only way I’m aware of to ask Spark for all currently persisted
>> RDDs:
>>
>> >>> a = spark.range(10).persist()>>> a.rdd.id()8>>> 
>> >>> list(spark.sparkContext._jsc.getPersistentRDDs().items())
>> [(3, JavaObject id=o36)]
>>
>> As you can see, the id of the persisted RDD, 8, doesn’t match the id
>> returned by getPersistentRDDs(), 3. So I can’t go through the RDDs
>> returned by getPersistentRDDs() and know which ones I want to keep.
>>
>> id() itself appears to be an undocumented method of the RDD API, and in
>> PySpark getPersistentRDDs() is buried behind the Java sub-objects
>> <https://issues.apache.org/jira/browse/SPARK-2141>, so I know I’m
>> reaching here. But is there a way to do what I want in PySpark without
>> manually tracking everything I’ve persisted myself?
>>
>> And more broadly speaking, do we want to add additional APIs, or
>> formalize currently undocumented APIs like id(), to make this use case
>> possible?
>>
>> Nick
>> ​
>>
>


Identifying specific persisted DataFrames via getPersistentRDDs()

2018-04-30 Thread Nicholas Chammas
This seems to be an underexposed part of the API. My use case is this: I
want to unpersist all DataFrames except a specific few. I want to do this
because I know at a specific point in my pipeline that I have a handful of
DataFrames that I need, and everything else is no longer needed.

The problem is that there doesn’t appear to be a way to identify specific
DataFrames (or rather, their underlying RDDs) via getPersistentRDDs(),
which is the only way I’m aware of to ask Spark for all currently persisted
RDDs:

>>> a = spark.range(10).persist()>>> a.rdd.id()8>>> 
>>> list(spark.sparkContext._jsc.getPersistentRDDs().items())
[(3, JavaObject id=o36)]

As you can see, the id of the persisted RDD, 8, doesn’t match the id
returned by getPersistentRDDs(), 3. So I can’t go through the RDDs returned
by getPersistentRDDs() and know which ones I want to keep.

id() itself appears to be an undocumented method of the RDD API, and in
PySpark getPersistentRDDs() is buried behind the Java sub-objects
, so I know I’m reaching
here. But is there a way to do what I want in PySpark without manually
tracking everything I’ve persisted myself?

And more broadly speaking, do we want to add additional APIs, or formalize
currently undocumented APIs like id(), to make this use case possible?

Nick
​


Re: Correlated subqueries in the DataFrame API

2018-04-27 Thread Nicholas Chammas
What about exposing transforms that make it easy to coerce data to what the
method needs? Instead of passing a dataframe, you’d pass df.toSet to isin

Assuming toSet returns a local list, wouldn’t that have the problem of not
being able to handle extremely large lists? In contrast, I believe SQL’s IN
is implemented in such a way that the inner query being referenced by IN
does not need to be collected locally. Did I understand your suggestion
correctly?

I think having .isin() accept a Column potentially makes more sense, since
that matches what happens in SQL in terms of semantics, and would hopefully
also preserve the distributed nature of the operation.

For example, I believe in most cases we’d want this

(table1
.where(
table1['name'].isin(
table2.select('name')
# table2['name']  # per Reynold's suggestion
)))

and this

(table1
.join(table2, on='name')
.select(table1['*']))

to compile down to the same physical plan. No?

Nick
​

On Thu, Apr 19, 2018 at 7:13 PM Reynold Xin <r...@databricks.com> wrote:

> Perhaps we can just have a function that turns a DataFrame into a Column?
> That'd work for both correlated and uncorrelated case, although in the
> correlated case we'd need to turn off eager analysis (otherwise there is no
> way to construct a valid DataFrame).
>
>
> On Thu, Apr 19, 2018 at 4:08 PM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Nick, thanks for raising this.
>>
>> It looks useful to have something in the DF API that behaves like
>> sub-queries, but I’m not sure that passing a DF works. Making every method
>> accept a DF that may contain matching data seems like it puts a lot of work
>> on the API — which now has to accept a DF all over the place.
>>
>> What about exposing transforms that make it easy to coerce data to what
>> the method needs? Instead of passing a dataframe, you’d pass df.toSet to
>> isin:
>>
>> val subQ = spark.sql("select distinct filter_col from source")
>> val df = table.filter($"col".isin(subQ.toSet))
>>
>> That also distinguishes between a sub-query and a correlated sub-query
>> that uses values from the outer query. We would still need to come up with
>> syntax for the correlated case, unless there’s a proposal already.
>>
>> rb
>> ​
>>
>> On Mon, Apr 9, 2018 at 3:56 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I just submitted SPARK-23945
>>> <https://issues.apache.org/jira/browse/SPARK-23945> but wanted to
>>> double check here to make sure I didn't miss something fundamental.
>>>
>>> Correlated subqueries are tracked at a high level in SPARK-18455
>>> <https://issues.apache.org/jira/browse/SPARK-18455>, but it's not clear
>>> to me whether they are "design-appropriate" for the DataFrame API.
>>>
>>> Are correlated subqueries a thing we can expect to have in the DataFrame
>>> API?
>>>
>>> Nick
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-10 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433316#comment-16433316
 ] 

Nicholas Chammas commented on SPARK-23945:
--

I always looked at DataFrames and SQL as two different interfaces to the same 
underlying logical model, so I just assumed that the vision was for them to be 
equally powerful. Is that not the case?

So in the grand scheme of things I'd expect DataFrames to be able to do 
everything that SQL can and vice versa, but for the narrow purposes of this 
ticket I'm just interested in {{IN }}and {{NOT IN.}}

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Correlated subqueries in the DataFrame API

2018-04-09 Thread Nicholas Chammas
I just submitted SPARK-23945
 but wanted to double
check here to make sure I didn't miss something fundamental.

Correlated subqueries are tracked at a high level in SPARK-18455
, but it's not clear to
me whether they are "design-appropriate" for the DataFrame API.

Are correlated subqueries a thing we can expect to have in the DataFrame
API?

Nick


[jira] [Updated] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-09 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-23945:
-
Description: 
In SQL you can filter rows based on the result of a subquery:
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
SELECT name
FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
.where(
~col('name').isin(
table2.select('name')
)
)
){code}
However, .isin() currently [only accepts a local list of 
values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to 
support correlated subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame 
API, and queries like this should instead be written in a different way? How 
would we write a query like the one I have above in the DataFrame API, without 
needing to collect values locally for the NOT IN filter?

 

  was:
In SQL you can filter rows based on the result of a subquery:

 
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
SELECT name
FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
.where(
~col('name').isin(
table2.select('name')
)
)
){code}
However, .isin() currently [only accepts a local list of 
values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to 
support correlated subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame 
API, and queries like this should instead be written in a different way? How 
would we write a query like the one I have above in the DataFrame API, without 
needing to collect values locally for the NOT IN filter?

 


> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-09 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-23945:


 Summary: Column.isin() should accept a single-column DataFrame as 
input
 Key: SPARK-23945
 URL: https://issues.apache.org/jira/browse/SPARK-23945
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Nicholas Chammas


In SQL you can filter rows based on the result of a subquery:

 
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
SELECT name
FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
.where(
~col('name').isin(
table2.select('name')
)
)
){code}
However, .isin() currently [only accepts a local list of 
values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to 
support correlated subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame 
API, and queries like this should instead be written in a different way? How 
would we write a query like the one I have above in the DataFrame API, without 
needing to collect values locally for the NOT IN filter?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-26 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414190#comment-16414190
 ] 

Nicholas Chammas commented on SPARK-22513:
--

Thanks for the breakdown. This will be handy for reference. So I guess at the 
summary level Sean was correct. :D

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>Priority: Major
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Changing how we compute release hashes

2018-03-23 Thread Nicholas Chammas
To close the loop here: SPARK-23716
<https://issues.apache.org/jira/browse/SPARK-23716>

On Fri, Mar 16, 2018 at 5:00 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> OK, will do.
>
> On Fri, Mar 16, 2018 at 4:41 PM Sean Owen <sro...@gmail.com> wrote:
>
>> I think you can file a JIRA and open a PR. All of the bits that use "gpg
>> ... SHA512 file ..." can use shasum instead.
>> I would not change any existing release artifacts though.
>>
>> On Fri, Mar 16, 2018 at 1:14 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I have sha512sum on my Mac via Homebrew, but yeah as long as the format
>>> is the same I suppose it doesn’t matter if we use shasum -a or sha512sum
>>> .
>>>
>>> So shall I file a JIRA + PR for this? Or should I leave the PR to a
>>> maintainer? And are we OK with updating all the existing release hashes to
>>> use the new format, or do we only want to do this for new releases?
>>> ​
>>>
>>> On Fri, Mar 16, 2018 at 1:50 PM Felix Cheung <felixcheun...@hotmail.com>
>>> wrote:
>>>
>>>> +1 there
>>>>
>>>> --
>>>> *From:* Sean Owen <sro...@gmail.com>
>>>> *Sent:* Friday, March 16, 2018 9:51:49 AM
>>>> *To:* Felix Cheung
>>>> *Cc:* rb...@netflix.com; Nicholas Chammas; Spark dev list
>>>>
>>>> *Subject:* Re: Changing how we compute release hashes
>>>> I think the issue with that is that OS X doesn't have "sha512sum". Both
>>>> it and Linux have "shasum -a 512" though.
>>>>
>>>> On Fri, Mar 16, 2018 at 11:05 AM Felix Cheung <
>>>> felixcheun...@hotmail.com> wrote:
>>>>
>>>>> Instead of using gpg to create the sha512 hash file we could just
>>>>> change to using sha512sum? That would output the right format that is in
>>>>> turns verifiable.
>>>>>
>>>>>
>>>>> --
>>>>> *From:* Ryan Blue <rb...@netflix.com.INVALID>
>>>>> *Sent:* Friday, March 16, 2018 8:31:45 AM
>>>>> *To:* Nicholas Chammas
>>>>> *Cc:* Spark dev list
>>>>> *Subject:* Re: Changing how we compute release hashes
>>>>>
>>>>> +1 It's possible to produce the same file with gpg, but the sha*sum
>>>>> utilities are a bit easier to remember the syntax for.
>>>>>
>>>>> On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> To verify that I’ve downloaded a Hadoop release correctly, I can just
>>>>>> do this:
>>>>>>
>>>>>> $ shasum --check hadoop-2.7.5.tar.gz.sha256
>>>>>> hadoop-2.7.5.tar.gz: OK
>>>>>>
>>>>>> However, since we generate Spark release hashes with GPG
>>>>>> <https://github.com/apache/spark/blob/c2632edebd978716dbfa7874a2fc0a8f5a4a9951/dev/create-release/release-build.sh#L167-L168>,
>>>>>> the resulting hash is in a format that doesn’t play well with any tools:
>>>>>>
>>>>>> $ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
>>>>>> shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
>>>>>> checksum lines found
>>>>>>
>>>>>> GPG doesn’t seem to offer a way to verify a file from a hash.
>>>>>>
>>>>>> I know I can always manipulate the SHA512 hash into a different
>>>>>> format or just manually inspect it, but as a “quality of life” 
>>>>>> improvement
>>>>>> can we change how we generate the SHA512 hash so that it plays nicely 
>>>>>> with
>>>>>> shasum? If it’s too disruptive to change the format of the SHA512
>>>>>> hash, can we add a SHA256 hash to our releases in this format?
>>>>>>
>>>>>> I suppose if it’s not easy to update or add hashes to our existing
>>>>>> releases, it may be too difficult to change anything here. But I’m not
>>>>>> sure, so I thought I’d ask.
>>>>>>
>>>>>> Nick
>>>>>> ​
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>


[jira] [Comment Edited] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility

2018-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412423#comment-16412423
 ] 

Nicholas Chammas edited comment on SPARK-23716 at 3/24/18 5:13 AM:
---

For my use case, there is no value in updating the Spark release code if we're 
not going to also update the release hashes for all prior releases, which it 
sounds like we don't want to do.

I wrote my own code to convert the GPG-style hashes to shasum style hashes, and 
that satisfies [my need|https://github.com/nchammas/flintrock/issues/238], 
which is focused on syncing Spark releases from the Apache archive to an S3 
bucket.

Closing this as "Won't Fix".


was (Author: nchammas):
For my use case, there is no value in updating the Spark release code if we're 
not going to also update the release hashes for all prior releases, which it 
sounds like we don't want to do.

I wrote my own code to convert the GPG-style hashes to shasum style hashes, and 
that satisfies my need. I am syncing Spark releases from the Apache 
distribution archive to a personal S3 bucket and need a way to verify the 
integrity of the files.

> Change SHA512 style in release artifacts to play nicely with shasum utility
> ---
>
> Key: SPARK-23716
> URL: https://issues.apache.org/jira/browse/SPARK-23716
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> As [discussed 
> here|http://apache-spark-developers-list.1001551.n3.nabble.com/Changing-how-we-compute-release-hashes-td23599.html].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility

2018-03-23 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-23716.
--
Resolution: Won't Fix

For my use case, there is no value in updating the Spark release code if we're 
not going to also update the release hashes for all prior releases, which it 
sounds like we don't want to do.

I wrote my own code to convert the GPG-style hashes to shasum style hashes, and 
that satisfies my need. I am syncing Spark releases from the Apache 
distribution archive to a personal S3 bucket and need a way to verify the 
integrity of the files.

> Change SHA512 style in release artifacts to play nicely with shasum utility
> ---
>
> Key: SPARK-23716
> URL: https://issues.apache.org/jira/browse/SPARK-23716
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> As [discussed 
> here|http://apache-spark-developers-list.1001551.n3.nabble.com/Changing-how-we-compute-release-hashes-td23599.html].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412218#comment-16412218
 ] 

Nicholas Chammas commented on SPARK-22513:
--

Fair enough.

Just as an alternate confirmation, [~ste...@apache.org] can you comment on 
whether there might be any issues running Spark built against Hadoop 2.7 with, 
say, HDFS 2.8?

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>Priority: Major
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405615#comment-16405615
 ] 

Nicholas Chammas commented on SPARK-23534:
--

I don't know what it takes to add a Hadoop 3.0 build profile to Spark, and I 
also don't know anyone who can help with this.

However, I don't see the urgency in getting this done. Who, really, is already 
running Hadoop 3.0 in production, or even getting close to that? And what 
vendors are already shipping Hadoop 3.0? Cloudera still ships 2.6 and EMR is on 
2.7.

Hadoop releases have historically been very disruptive, with 
backwards-incompatible changes even in minor releases. If that's still the 
case, I doubt many people will be running Hadoop 3.0 anytime soon. But I 
confess my stance here is based on my experience a couple of years ago and may 
no longer be relevant.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405563#comment-16405563
 ] 

Nicholas Chammas commented on SPARK-23534:
--

I believe this ticket is a duplicate of SPARK-23151, though given the activity 
here perhaps we should close that one in favor of this ticket here.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405561#comment-16405561
 ] 

Nicholas Chammas commented on SPARK-22513:
--

[~srowen] - Just curious: How do you know that Spark built against Hadoop 2.7 
should work with Hadoop 2.8? Is there a reference for that somewhere?

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>Priority: Major
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility

2018-03-16 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-23716:


 Summary: Change SHA512 style in release artifacts to play nicely 
with shasum utility
 Key: SPARK-23716
 URL: https://issues.apache.org/jira/browse/SPARK-23716
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.3.0
Reporter: Nicholas Chammas


As [discussed 
here|http://apache-spark-developers-list.1001551.n3.nabble.com/Changing-how-we-compute-release-hashes-td23599.html].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
OK, will do.

On Fri, Mar 16, 2018 at 4:41 PM Sean Owen <sro...@gmail.com> wrote:

> I think you can file a JIRA and open a PR. All of the bits that use "gpg
> ... SHA512 file ..." can use shasum instead.
> I would not change any existing release artifacts though.
>
> On Fri, Mar 16, 2018 at 1:14 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I have sha512sum on my Mac via Homebrew, but yeah as long as the format
>> is the same I suppose it doesn’t matter if we use shasum -a or sha512sum.
>>
>> So shall I file a JIRA + PR for this? Or should I leave the PR to a
>> maintainer? And are we OK with updating all the existing release hashes to
>> use the new format, or do we only want to do this for new releases?
>> ​
>>
>> On Fri, Mar 16, 2018 at 1:50 PM Felix Cheung <felixcheun...@hotmail.com>
>> wrote:
>>
>>> +1 there
>>>
>>> --
>>> *From:* Sean Owen <sro...@gmail.com>
>>> *Sent:* Friday, March 16, 2018 9:51:49 AM
>>> *To:* Felix Cheung
>>> *Cc:* rb...@netflix.com; Nicholas Chammas; Spark dev list
>>>
>>> *Subject:* Re: Changing how we compute release hashes
>>> I think the issue with that is that OS X doesn't have "sha512sum". Both
>>> it and Linux have "shasum -a 512" though.
>>>
>>> On Fri, Mar 16, 2018 at 11:05 AM Felix Cheung <felixcheun...@hotmail.com>
>>> wrote:
>>>
>>>> Instead of using gpg to create the sha512 hash file we could just
>>>> change to using sha512sum? That would output the right format that is in
>>>> turns verifiable.
>>>>
>>>>
>>>> --
>>>> *From:* Ryan Blue <rb...@netflix.com.INVALID>
>>>> *Sent:* Friday, March 16, 2018 8:31:45 AM
>>>> *To:* Nicholas Chammas
>>>> *Cc:* Spark dev list
>>>> *Subject:* Re: Changing how we compute release hashes
>>>>
>>>> +1 It's possible to produce the same file with gpg, but the sha*sum
>>>> utilities are a bit easier to remember the syntax for.
>>>>
>>>> On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> To verify that I’ve downloaded a Hadoop release correctly, I can just
>>>>> do this:
>>>>>
>>>>> $ shasum --check hadoop-2.7.5.tar.gz.sha256
>>>>> hadoop-2.7.5.tar.gz: OK
>>>>>
>>>>> However, since we generate Spark release hashes with GPG
>>>>> <https://github.com/apache/spark/blob/c2632edebd978716dbfa7874a2fc0a8f5a4a9951/dev/create-release/release-build.sh#L167-L168>,
>>>>> the resulting hash is in a format that doesn’t play well with any tools:
>>>>>
>>>>> $ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
>>>>> shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
>>>>> checksum lines found
>>>>>
>>>>> GPG doesn’t seem to offer a way to verify a file from a hash.
>>>>>
>>>>> I know I can always manipulate the SHA512 hash into a different format
>>>>> or just manually inspect it, but as a “quality of life” improvement can we
>>>>> change how we generate the SHA512 hash so that it plays nicely with
>>>>> shasum? If it’s too disruptive to change the format of the SHA512
>>>>> hash, can we add a SHA256 hash to our releases in this format?
>>>>>
>>>>> I suppose if it’s not easy to update or add hashes to our existing
>>>>> releases, it may be too difficult to change anything here. But I’m not
>>>>> sure, so I thought I’d ask.
>>>>>
>>>>> Nick
>>>>> ​
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>


Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
I have sha512sum on my Mac via Homebrew, but yeah as long as the format is
the same I suppose it doesn’t matter if we use shasum -a or sha512sum.

So shall I file a JIRA + PR for this? Or should I leave the PR to a
maintainer? And are we OK with updating all the existing release hashes to
use the new format, or do we only want to do this for new releases?
​

On Fri, Mar 16, 2018 at 1:50 PM Felix Cheung <felixcheun...@hotmail.com>
wrote:

> +1 there
>
> --
> *From:* Sean Owen <sro...@gmail.com>
> *Sent:* Friday, March 16, 2018 9:51:49 AM
> *To:* Felix Cheung
> *Cc:* rb...@netflix.com; Nicholas Chammas; Spark dev list
>
> *Subject:* Re: Changing how we compute release hashes
> I think the issue with that is that OS X doesn't have "sha512sum". Both it
> and Linux have "shasum -a 512" though.
>
> On Fri, Mar 16, 2018 at 11:05 AM Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> Instead of using gpg to create the sha512 hash file we could just change
>> to using sha512sum? That would output the right format that is in turns
>> verifiable.
>>
>>
>> ------
>> *From:* Ryan Blue <rb...@netflix.com.INVALID>
>> *Sent:* Friday, March 16, 2018 8:31:45 AM
>> *To:* Nicholas Chammas
>> *Cc:* Spark dev list
>> *Subject:* Re: Changing how we compute release hashes
>>
>> +1 It's possible to produce the same file with gpg, but the sha*sum
>> utilities are a bit easier to remember the syntax for.
>>
>> On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> To verify that I’ve downloaded a Hadoop release correctly, I can just do
>>> this:
>>>
>>> $ shasum --check hadoop-2.7.5.tar.gz.sha256
>>> hadoop-2.7.5.tar.gz: OK
>>>
>>> However, since we generate Spark release hashes with GPG
>>> <https://github.com/apache/spark/blob/c2632edebd978716dbfa7874a2fc0a8f5a4a9951/dev/create-release/release-build.sh#L167-L168>,
>>> the resulting hash is in a format that doesn’t play well with any tools:
>>>
>>> $ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
>>> shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
>>> checksum lines found
>>>
>>> GPG doesn’t seem to offer a way to verify a file from a hash.
>>>
>>> I know I can always manipulate the SHA512 hash into a different format
>>> or just manually inspect it, but as a “quality of life” improvement can we
>>> change how we generate the SHA512 hash so that it plays nicely with
>>> shasum? If it’s too disruptive to change the format of the SHA512 hash,
>>> can we add a SHA256 hash to our releases in this format?
>>>
>>> I suppose if it’s not easy to update or add hashes to our existing
>>> releases, it may be too difficult to change anything here. But I’m not
>>> sure, so I thought I’d ask.
>>>
>>> Nick
>>> ​
>>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


Changing how we compute release hashes

2018-03-15 Thread Nicholas Chammas
To verify that I’ve downloaded a Hadoop release correctly, I can just do
this:

$ shasum --check hadoop-2.7.5.tar.gz.sha256
hadoop-2.7.5.tar.gz: OK

However, since we generate Spark release hashes with GPG
,
the resulting hash is in a format that doesn’t play well with any tools:

$ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted
SHA1 checksum lines found

GPG doesn’t seem to offer a way to verify a file from a hash.

I know I can always manipulate the SHA512 hash into a different format or
just manually inspect it, but as a “quality of life” improvement can we
change how we generate the SHA512 hash so that it plays nicely with shasum?
If it’s too disruptive to change the format of the SHA512 hash, can we add
a SHA256 hash to our releases in this format?

I suppose if it’s not easy to update or add hashes to our existing
releases, it may be too difficult to change anything here. But I’m not
sure, so I thought I’d ask.

Nick
​


Re: Silencing messages from Ivy when calling spark-submit

2018-03-12 Thread Nicholas Chammas
Looks like the only way to change the Ivy log setting, given how we're
using it in Spark, is via code
<https://stackoverflow.com/questions/49000342/suppress-messages-from-spark-submit-when-loading-packages/49188333#49188333>.
This makes me think: People often have trouble
<https://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console>
(77K
views) silencing undesired output from Spark.

Would it make sense to have a convenience option in spark-submit that lets
you set a global log level for Ivy, log4j, and anything else? i.e. Without
needing to edit any separate files or look at other projects' documentation.

It sure would be handy for everyday use, but I wonder if there is a design
reason we wouldn't want to add something like that.

On Tue, Mar 6, 2018 at 1:14 PM Bryan Cutler <cutl...@gmail.com> wrote:

> Cool, hopefully it will work.  I don't know what setting that would be
> though, but it seems like it might be somewhere under here
> http://ant.apache.org/ivy/history/latest-milestone/settings/outputters.html.
> It's pretty difficult to sort through the docs, and I often found myself
> looking at the source to understand some settings.  If you happen to figure
> out the answer, please report back here.  I'm sure others would find it
> useful too.
>
> Bryan
>
> On Mon, Mar 5, 2018 at 3:50 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Oh, I didn't know about that. I think that will do the trick.
>>
>> Would you happen to know what setting I need? I'm looking here
>> <http://ant.apache.org/ivy/history/latest-milestone/settings.html>, but
>> it's a bit overwhelming. I'm basically looking for a way to set the overall
>> Ivy log level to WARN or higher.
>>
>> Nick
>>
>> On Mon, Mar 5, 2018 at 2:11 PM Bryan Cutler <cutl...@gmail.com> wrote:
>>
>>> Hi Nick,
>>>
>>> Not sure about changing the default to warnings only because I think
>>> some might find the resolution output useful, but you can specify your own
>>> ivy settings file with "spark.jars.ivySettings" to point to your
>>> ivysettings.xml file.  Would that work for you to configure it there?
>>>
>>> Bryan
>>>
>>> On Mon, Mar 5, 2018 at 8:20 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> I couldn’t get an answer anywhere else, so I thought I’d ask here.
>>>>
>>>> Is there a way to silence the messages that come from Ivy when you call
>>>> spark-submit with --packages? (For the record, I asked this question on
>>>> Stack Overflow <https://stackoverflow.com/q/49000342/877069>.)
>>>>
>>>> Would it be a good idea to configure Ivy by default to only output
>>>> warnings or errors?
>>>>
>>>> Nick
>>>> ​
>>>>
>>>
>>>
>


[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2018-03-07 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389723#comment-16389723
 ] 

Nicholas Chammas commented on SPARK-18492:
--

[~imranshaik] - This is an open source project. You cannot demand that anyone 
"solve this asap". People are contributing their free time or time donated by 
the companies that employ them.

If you want someone to fix this for you "asap", perhaps you should look at paid 
support from Databricks, Cloudera, Hortonworks, Amazon, or some other big 
company that works with Spark.

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>Priority: Major
> Attachments: Screenshot from 2018-03-02 12-57-51.png
>
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */

Re: Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Nicholas Chammas
Oh, I didn't know about that. I think that will do the trick.

Would you happen to know what setting I need? I'm looking here
<http://ant.apache.org/ivy/history/latest-milestone/settings.html>, but
it's a bit overwhelming. I'm basically looking for a way to set the overall
Ivy log level to WARN or higher.

Nick

On Mon, Mar 5, 2018 at 2:11 PM Bryan Cutler <cutl...@gmail.com> wrote:

> Hi Nick,
>
> Not sure about changing the default to warnings only because I think some
> might find the resolution output useful, but you can specify your own ivy
> settings file with "spark.jars.ivySettings" to point to your
> ivysettings.xml file.  Would that work for you to configure it there?
>
> Bryan
>
> On Mon, Mar 5, 2018 at 8:20 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I couldn’t get an answer anywhere else, so I thought I’d ask here.
>>
>> Is there a way to silence the messages that come from Ivy when you call
>> spark-submit with --packages? (For the record, I asked this question on
>> Stack Overflow <https://stackoverflow.com/q/49000342/877069>.)
>>
>> Would it be a good idea to configure Ivy by default to only output
>> warnings or errors?
>>
>> Nick
>> ​
>>
>
>


Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Nicholas Chammas
I couldn’t get an answer anywhere else, so I thought I’d ask here.

Is there a way to silence the messages that come from Ivy when you call
spark-submit with --packages? (For the record, I asked this question on
Stack Overflow .)

Would it be a good idea to configure Ivy by default to only output warnings
or errors?

Nick
​


[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2018-03-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383843#comment-16383843
 ] 

Nicholas Chammas commented on SPARK-18492:
--

Are you seeing the same on Spark 2.3.0? Apparently, a bunch of problems related 
to the 64KB limit were resolved. They may not have an impact on this error with 
GeneratedIterator, but it's good to check just in case.

From [http://spark.apache.org/releases/spark-release-2-3-0.html:]

*Performance and stability*
 * [SPARK-22510][SPARK-22692][SPARK-21871] Further stabilize the codegen 
framework to avoid hitting the {{64KB}} JVM bytecode limit on the Java method 
and Java compiler constant pool limit

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>Priority: Major
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
&

Re: Please keep s3://spark-related-packages/ alive

2018-03-01 Thread Nicholas Chammas
Marton,

Thanks for the tip. (Too bad the docs
 referenced from the issue
I opened with INFRA  make
no mention of mirrors.cgi.)

Matei,

A Requester Pays bucket is a good idea. I was trying to avoid having to
maintain a repository of assets, but I suppose it's ultimately unavoidable
given that Apache does not partner with a CDN. I will look into this for
Flintrock.

Nick

On Wed, Feb 28, 2018 at 7:21 AM Marton, Elek  wrote:

>
> >  2. *Apache mirrors are inconvenient to use.* When you download
> > something from an Apache mirror, you get a link like this one
> > <
> https://www.apache.org/dyn/closer.lua/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
> >.
> > Instead of automatically redirecting you to your download, though,
> > you need to process the results you get back
> > <
> https://github.com/nchammas/flintrock/blob/67bf84a1b7cfa1c276cf57ecd8a0b27613ad2698/flintrock/scripts/download-hadoop.py#L21-L42
> >
> > to find your download target. And you need to handle the high
> > download failure rate, since sometimes the mirror you get doesn’t
> > have the file it claims to have.
>
> It's not a full answer, just a note:
>
> You can also use mirrors.cgi instead of parsing the json from closer.lua:
>
>
> https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download=spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
>
> (Unfortunatelly it doesn't check the availibility of the file. If it's
> moved to the archive you will be redirected to a 404)
>
> Marton
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Please keep s3://spark-related-packages/ alive

2018-02-27 Thread Nicholas Chammas
So is there no hope for this S3 bucket, or room to replace it with a bucket
owned by some organization other than AMPLab (which is technically now
defunct <https://amplab.cs.berkeley.edu/endofproject/>, I guess)? Sorry to
persist, but I just have to ask.

On Tue, Feb 27, 2018 at 10:36 AM Michael Heuer <heue...@gmail.com> wrote:

> On Tue, Feb 27, 2018 at 8:17 AM, Sean Owen <sro...@gmail.com> wrote:
>
>> See
>> http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-d3kbcqa49mib13-cloudfront-net-td22427.html
>>  --
>> it was 'retired', yes.
>>
>> Agree with all that, though they're intended for occasional individual
>> use and not a case where performance and uptime matter. For that, I think
>> you'd want to just host your own copy of the bits you need.
>>
>> The notional problem was that the S3 bucket wasn't obviously
>> controlled/blessed by the ASF and yet was a source of official bits. It was
>> another set of third-party credentials to hand around to release managers,
>> which was IIRC a little problematic.
>>
>> Homebrew does host distributions of ASF projects, like Spark, FWIW.
>>
>
> To clarify, the apache-spark.rb formula in Homebrew uses the Apache
> mirror closer.lua script
>
>
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb#L4
>
>michael
>
>
>
>> On Mon, Feb 26, 2018 at 10:57 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> If you go to the Downloads <http://spark.apache.org/downloads.html>
>>> page and download Spark 2.2.1, you’ll get a link to an Apache mirror. It
>>> didn’t use to be this way. As recently as Spark 2.2.0, downloads were
>>> served via CloudFront <https://aws.amazon.com/cloudfront/>, which was
>>> backed by an S3 bucket named spark-related-packages.
>>>
>>> It seems that we’ve stopped using CloudFront, and the S3 bucket behind
>>> it has stopped receiving updates (e.g. Spark 2.2.1 isn’t there). I’m
>>> guessing this is part of an effort to use the Apache mirror network, like
>>> other Apache projects do.
>>>
>>> From a user perspective, the Apache mirror network is several steps down
>>> from using a modern CDN. Let me summarize why:
>>>
>>>1. *Apache mirrors are often slow.* Apache does not impose any
>>>performance requirements on its mirrors
>>>
>>> <https://issues.apache.org/jira/browse/INFRA-10999?focusedCommentId=15717950=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15717950>.
>>>The difference between getting a good mirror and a bad one means
>>>downloading Spark in less than a minute vs. 20 minutes. The problem is so
>>>bad that I’ve thought about adding an Apache mirror blacklist
>>><https://github.com/nchammas/flintrock/issues/84#issuecomment-185038678>
>>>to Flintrock to avoid getting one of these dud mirrors.
>>>2. *Apache mirrors are inconvenient to use.* When you download
>>>something from an Apache mirror, you get a link like this one
>>>
>>> <https://www.apache.org/dyn/closer.lua/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz>.
>>>Instead of automatically redirecting you to your download, though, you 
>>> need
>>>to process the results you get back
>>>
>>> <https://github.com/nchammas/flintrock/blob/67bf84a1b7cfa1c276cf57ecd8a0b27613ad2698/flintrock/scripts/download-hadoop.py#L21-L42>
>>>to find your download target. And you need to handle the high download
>>>failure rate, since sometimes the mirror you get doesn’t have the file it
>>>claims to have.
>>>3. *Apache mirrors are incomplete.* Apache mirrors only keep around
>>>the latest releases, save for a few “archive” mirrors, which are often
>>>slow. So if you want to download anything but the latest version of 
>>> Spark,
>>>you are out of luck.
>>>
>>> Some of these problems can be mitigated by picking a specific mirror
>>> that works well and hardcoding it in your scripts, but that defeats the
>>> purpose of dynamically selecting a mirror and makes you a “bad” user of the
>>> mirror network.
>>>
>>> I raised some of these issues over on INFRA-10999
>>> <https://issues.apache.org/jira/browse/INFRA-10999>. The ticket sat for
>>> a year before I heard anything back, and the bottom line was that none of
>>> the above problems have a solution on the horizon. It’s fine. I under

Suppressing output from Apache Ivy (?) when calling spark-submit with --packages

2018-02-27 Thread Nicholas Chammas
I’m not sure whether this is something controllable via Spark, but when you
call spark-submit with --packages you get a lot of output. Is there any way
to suppress it? Does it come from Apache Ivy?

I posted more details about what I’m seeing on Stack Overflow
.

Nick


Please keep s3://spark-related-packages/ alive

2018-02-26 Thread Nicholas Chammas
If you go to the Downloads  page
and download Spark 2.2.1, you’ll get a link to an Apache mirror. It didn’t
use to be this way. As recently as Spark 2.2.0, downloads were served via
CloudFront , which was backed by an S3
bucket named spark-related-packages.

It seems that we’ve stopped using CloudFront, and the S3 bucket behind it
has stopped receiving updates (e.g. Spark 2.2.1 isn’t there). I’m guessing
this is part of an effort to use the Apache mirror network, like other
Apache projects do.

>From a user perspective, the Apache mirror network is several steps down
from using a modern CDN. Let me summarize why:

   1. *Apache mirrors are often slow.* Apache does not impose any
   performance requirements on its mirrors
   
.
   The difference between getting a good mirror and a bad one means
   downloading Spark in less than a minute vs. 20 minutes. The problem is so
   bad that I’ve thought about adding an Apache mirror blacklist
   
   to Flintrock to avoid getting one of these dud mirrors.
   2. *Apache mirrors are inconvenient to use.* When you download something
   from an Apache mirror, you get a link like this one
   
.
   Instead of automatically redirecting you to your download, though, you need
   to process the results you get back
   

   to find your download target. And you need to handle the high download
   failure rate, since sometimes the mirror you get doesn’t have the file it
   claims to have.
   3. *Apache mirrors are incomplete.* Apache mirrors only keep around the
   latest releases, save for a few “archive” mirrors, which are often slow. So
   if you want to download anything but the latest version of Spark, you are
   out of luck.

Some of these problems can be mitigated by picking a specific mirror that
works well and hardcoding it in your scripts, but that defeats the purpose
of dynamically selecting a mirror and makes you a “bad” user of the mirror
network.

I raised some of these issues over on INFRA-10999
. The ticket sat for a
year before I heard anything back, and the bottom line was that none of the
above problems have a solution on the horizon. It’s fine. I understand that
Apache is a volunteer organization and that the infrastructure team has a
lot to manage as it is. I still find it disappointing that an organization
of Apache’s stature doesn’t have a better solution for this in
collaboration with a third party. Python serves PyPI downloads using Fastly
 and Homebrew serves packages using Bintray
. They both work really, really well. Why don’t we
have something as good for Apache projects? Anyway, that’s a separate
discussion.

What I want to say is this:

Dear whoever owns the spark-related-packages S3 bucket
,

Please keep the bucket up-to-date with the latest Spark releases, alongside
the past releases that are already on there. It’s a huge help to the
Flintrock  project, and it’s an
equally big help to those of us writing infrastructure automation scripts
that deploy Spark in other contexts.

I understand that hosting this stuff is not free, and that I am not paying
anything for this service. If it needs to go, so be it. But I wanted to
take this opportunity to lay out the benefits I’ve enjoyed thanks to having
this bucket around, and to make sure that if it did die, it didn’t die a
quiet death.

Sincerely,
Nick
​


Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-23 Thread Nicholas Chammas
Launched a test cluster on EC2 with Flintrock
 and ran some simple tests. Building
Spark took much longer than usual, but that may just be a fluke. Otherwise,
all looks good to me.

+1

On Fri, Feb 23, 2018 at 10:55 AM Denny Lee  wrote:

> +1 (non-binding)
>
> On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough <
> joshgoldsboroughs...@gmail.com> wrote:
>
>> New to testing out Spark RCs for the community but I was able to run some
>> of the basic unit tests without error so for what it's worth, I'm a +1.
>>
>> On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.3.0. The vote is open until Tuesday February 27, 2018 at 8:00:00 am UTC
>>> and passes if a majority of at least 3 PMC +1 votes are cast.
>>>
>>>
>>> [ ] +1 Release this package as Apache Spark 2.3.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.3.0-rc5:
>>> https://github.com/apache/spark/tree/v2.3.0-rc5
>>> (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)
>>>
>>> List of JIRA tickets resolved in this release can be found here:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1266/
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/index.html
>>>
>>>
>>> FAQ
>>>
>>> ===
>>> What are the unresolved issues targeted for 2.3.0?
>>> ===
>>>
>>> Please see https://s.apache.org/oXKi. At the time of writing, there are
>>> currently no known release blockers.
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala you
>>> can add the staging repository to your projects resolvers and test with the
>>> RC (make sure to clean up the artifact cache before/after so you don't end
>>> up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.3.0?
>>> ===
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.1 or 2.4.0 as
>>> appropriate.
>>>
>>> ===
>>> Why is my bug not fixed?
>>> ===
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.2.0. That being said, if
>>> there is something which is a regression from 2.2.0 and has not been
>>> correctly targeted please ping me or a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.3.0 at
>>> https://s.apache.org/WmoI).
>>>
>>
>>


Re: Kubernetes: why use init containers?

2018-01-09 Thread Nicholas Chammas
I’d like to point out the output of “git show —stat” for that diff:
29 files changed, 130 insertions(+), 1560 deletions(-)

+1 for that and generally for the idea of leveraging spark-submit.

You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I’m not sure I’d agree - it can go both ways.

On a tangentially related note, one of the main reasons spark-ec2
 is so slow to launch clusters is that
it distributes files like the Spark binaries to all the workers via the
master. Because of that, the launch time scaled with the number of workers
requested .

When I wrote Flintrock , I got a
large improvement in launch time over spark-ec2 simply by having all the
workers download the installation files in parallel from an external host
(typically S3 or an Apache mirror). And launch time became largely
independent of the cluster size.

That may or may not say anything about the driver distributing application
files vs. having init containers do it in parallel, but I’d be curious to
hear more.

Nick
​

On Tue, Jan 9, 2018 at 9:08 PM Anirudh Ramanathan
 wrote:

> We were running a change in our fork which was similar to this at one
> point early on. My biggest concerns off the top of my head with this change
> would be localization performance with large numbers of executors, and what
> we lose in terms of separation of concerns. Init containers are a standard
> construct in k8s for resource localization. Also how this approach affects
> the HDFS work would be interesting.
>
> +matt +kimoon
> Still thinking about the potential trade offs here. Adding Matt and Kimoon
> who would remember more about our reasoning at the time.
>
>
> On Jan 9, 2018 5:22 PM, "Marcelo Vanzin"  wrote:
>
>> Hello,
>>
>> Me again. I was playing some more with the kubernetes backend and the
>> whole init container thing seemed unnecessary to me.
>>
>> Currently it's used to download remote jars and files, mount the
>> volume into the driver / executor, and place those jars in the
>> classpath / move the files to the working directory. This is all stuff
>> that spark-submit already does without needing extra help.
>>
>> So I spent some time hacking stuff and removing the init container
>> code, and launching the driver inside kubernetes using spark-submit
>> (similar to how standalone and mesos cluster mode works):
>>
>> https://github.com/vanzin/spark/commit/k8s-no-init
>>
>> I'd like to point out the output of "git show --stat" for that diff:
>>  29 files changed, 130 insertions(+), 1560 deletions(-)
>>
>> You get massive code reuse by simply using spark-submit. The remote
>> dependencies are downloaded in the driver, and the driver does the job
>> of service them to executors.
>>
>> So I guess my question is: is there any advantage in using an init
>> container?
>>
>> The current init container code can download stuff in parallel, but
>> that's an easy improvement to make in spark-submit and that would
>> benefit everybody. You can argue that executors downloading from
>> external servers would be faster than downloading from the driver, but
>> I'm not sure I'd agree - it can go both ways.
>>
>> Also the same idea could probably be applied to starting executors;
>> Mesos starts executors using "spark-class" already, so doing that
>> would both improve code sharing and potentially simplify some code in
>> the k8s backend.
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


[jira] [Commented] (ORC-152) Saving empty Spark DataFrame via ORC does not preserve schema

2017-12-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289678#comment-16289678
 ] 

Nicholas Chammas commented on ORC-152:
--

A link to the matching Spark issue is in the description above.

> Saving empty Spark DataFrame via ORC does not preserve schema
> -
>
> Key: ORC-152
> URL: https://issues.apache.org/jira/browse/ORC-152
> Project: ORC
>  Issue Type: Bug
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> Details are on SPARK-15474.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2017-10-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217115#comment-16217115
 ] 

Nicholas Chammas commented on SPARK-13587:
--

To follow-up on my [earlier 
comment|https://issues.apache.org/jira/browse/SPARK-13587?focusedCommentId=15740419=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15740419],
 I created a [completely self-contained sample 
repo|https://github.com/massmutual/sample-pyspark-application] demonstrating a 
technique for bundling PySpark app dependencies in an isolated way. It's the 
technique that Ben, I, and several others discussed here in this JIRA issue.

https://github.com/massmutual/sample-pyspark-application

The approach has advantages (like letting you ship a completely isolated Python 
environment, so you don't even need Python installed on the workers) and 
disadvantages (requires YARN; increases job startup time). Hope some of you 
find the sample repo useful until Spark adds more "first-class" support for 
Python dependency isolation.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Disabling Closed -> Reopened transition for non-committers

2017-10-05 Thread Nicholas Chammas
Whoops, didn’t mean to send that out to the list. Apologies. Somehow, an
earlier draft of my email got sent out.

Nick


2017년 10월 5일 (목) 오전 9:20, Nicholas Chammas <nicholas.cham...@gmail.com>님이
작성:

> The first sign that that conversation was going to go downhill was when
> the user [demanded](
> https://issues.apache.org/jira/browse/SPARK-21999?focusedCommentId=16166516=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16166516
> ):
>
> > Since we have ruled that out and given that [other conditions] could you
> have it tracked down and resolved ?
>
> Nick
>

>
> On Wed, Oct 4, 2017 at 10:06 PM Sean Owen <so...@cloudera.com> wrote:
>
>> We have this problem occasionally, where a disgruntled user continually
>> reopens an issue after it's closed.
>>
>> https://issues.apache.org/jira/browse/SPARK-21999
>>
>> (Feel free to comment on this one if anyone disagrees)
>>
>> Regardless of that particular JIRA, I'd like to disable to Closed ->
>> Reopened transition for non-committers:
>> https://issues.apache.org/jira/browse/INFRA-15221
>>
>>


[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-09-15 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168038#comment-16168038
 ] 

Nicholas Chammas commented on SPARK-17025:
--

I take that back. I won't be able to test this for the time being. If someone 
else wants to test this out and needs pointers, I'd be happy to help.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Run a specific PySpark test or group of tests

2017-08-16 Thread Nicholas Chammas
Looks like it doesn’t take too much work to get pytest working on our code
base, since it knows how to run unittest tests.

https://github.com/apache/spark/compare/master…nchammas:pytest
<https://github.com/apache/spark/compare/master...nchammas:pytest>

For example I was able to do this from that branch and it did the right
thing, running only the tests with string in their name:

python [pytest *]$ ../bin/spark-submit ./pytest-run-tests.py
./pyspark/sql/tests.py -v -k string

However, looking more closely at the whole test setup, I’m hesitant to work
any further on this.

My intention was to see if we could leverage pytest, tox, and other test
tools that are standard in the Python ecosystem to replace some of the
homegrown stuff we have. We have our own test dependency tracking code, our
own breakdown of tests into module-scoped chunks, and our own machinery to
parallelize test execution. It seems like it would be a lot of work to reap
the benefits of using the standard tools while ensuring that we don’t lose
any of the benefits our current test setup provides.

Nick

On Tue, Aug 15, 2017 at 3:26 PM Bryan Cutler cutl...@gmail.com
<http://mailto:cutl...@gmail.com> wrote:

This generally works for me to just run tests within a class or even a
> single test.  Not as flexible as pytest -k, which would be nice..
>
> $ SPARK_TESTING=1 bin/pyspark pyspark.sql.tests ArrowTests
> On Tue, Aug 15, 2017 at 5:49 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Pytest does support unittest-based tests
>> <https://docs.pytest.org/en/latest/unittest.html>, allowing for
>> incremental adoption. I'll see how convenient it is to use with our current
>> test layout.
>>
>> On Tue, Aug 15, 2017 at 1:03 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>
>>> For me, I would like this if this can be done with relatively small
>>> changes.
>>> How about adding more granular options, for example, specifying or
>>> filtering smaller set of test goals in the run-tests.py script?
>>> I think it'd be quite small change and we could roughly reach this goal
>>> if I understood correctly.
>>>
>>>
>>> 2017-08-15 3:06 GMT+09:00 Nicholas Chammas <nicholas.cham...@gmail.com>:
>>>
>>>> Say you’re working on something and you want to rerun the PySpark
>>>> tests, focusing on a specific test or group of tests. Is there a way to do
>>>> that?
>>>>
>>>> I know that you can test entire modules with this:
>>>>
>>>> ./python/run-tests --modules pyspark-sql
>>>>
>>>> But I’m looking for something more granular, like pytest’s -k option.
>>>>
>>>> On that note, does anyone else think it would be valuable to use a test
>>>> runner like pytest to run our Python tests? The biggest benefits would be
>>>> the use of fixtures <https://docs.pytest.org/en/latest/fixture.html>,
>>>> and more flexibility on test running and reporting. Just wondering if we’ve
>>>> already considered this.
>>>>
>>>> Nick
>>>> ​
>>>>
>>>
>>> ​


[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-08-15 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128271#comment-16128271
 ] 

Nicholas Chammas commented on SPARK-17025:
--

I'm still interested in this but I won't be able to test it until mid-next 
month, unfortunately. I've set myself a reminder to revisit this.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Run a specific PySpark test or group of tests

2017-08-15 Thread Nicholas Chammas
Pytest does support unittest-based tests
<https://docs.pytest.org/en/latest/unittest.html>, allowing for incremental
adoption. I'll see how convenient it is to use with our current test layout.

On Tue, Aug 15, 2017 at 1:03 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:

> For me, I would like this if this can be done with relatively small
> changes.
> How about adding more granular options, for example, specifying or
> filtering smaller set of test goals in the run-tests.py script?
> I think it'd be quite small change and we could roughly reach this goal if
> I understood correctly.
>
>
> 2017-08-15 3:06 GMT+09:00 Nicholas Chammas <nicholas.cham...@gmail.com>:
>
>> Say you’re working on something and you want to rerun the PySpark tests,
>> focusing on a specific test or group of tests. Is there a way to do that?
>>
>> I know that you can test entire modules with this:
>>
>> ./python/run-tests --modules pyspark-sql
>>
>> But I’m looking for something more granular, like pytest’s -k option.
>>
>> On that note, does anyone else think it would be valuable to use a test
>> runner like pytest to run our Python tests? The biggest benefits would be
>> the use of fixtures <https://docs.pytest.org/en/latest/fixture.html>,
>> and more flexibility on test running and reporting. Just wondering if we’ve
>> already considered this.
>>
>> Nick
>> ​
>>
>
>


Run a specific PySpark test or group of tests

2017-08-14 Thread Nicholas Chammas
Say you’re working on something and you want to rerun the PySpark tests,
focusing on a specific test or group of tests. Is there a way to do that?

I know that you can test entire modules with this:

./python/run-tests --modules pyspark-sql

But I’m looking for something more granular, like pytest’s -k option.

On that note, does anyone else think it would be valuable to use a test
runner like pytest to run our Python tests? The biggest benefits would be
the use of fixtures , and
more flexibility on test running and reporting. Just wondering if we’ve
already considered this.

Nick
​


[jira] [Created] (SPARK-21712) Clarify PySpark Column.substr() type checking error message

2017-08-11 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-21712:


 Summary: Clarify PySpark Column.substr() type checking error 
message
 Key: SPARK-21712
 URL: https://issues.apache.org/jira/browse/SPARK-21712
 Project: Spark
  Issue Type: Documentation
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Nicholas Chammas
Priority: Trivial


https://github.com/apache/spark/blob/f0169a1c6a1ac06045d57f8aaa2c841bb39e23ac/python/pyspark/sql/column.py#L408-L409

"Can not mix the type" is really unclear.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Trouble with PySpark UDFs and SPARK_HOME only on EMR

2017-06-22 Thread Nicholas Chammas
Here’s a repro for a very similar issue where Spark hangs on the UDF, which
I think is related to the SPARK_HOME issue. I posted the repro on the EMR
forum ,
but in case you can’t access it:

   1. I’m running EMR 5.6.0, Spark 2.1.1, and Python 3.5.1.
   2. Create a simple Python package by creating a directory called udftest.
   3. Inside udftest put an empty __init__.py and a nothing.py.
   4.

   nothing.py should have the following contents:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

def do_nothing(s: int) -> int:
return s

do_nothing_udf = udf(do_nothing, IntegerType())

   5.

   From your home directory (the one that contains your udftest package),
   create a ZIP that we will ship to YARN.

pushd udftest/
zip -rq ../udftest.zip *
popd

   6.

   Start a PySpark shell with our test package.

export PYSPARK_PYTHON=python3
pyspark \
  --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=$PYSPARK_PYTHON" \
  --archives "udftest.zip#udftest"

   7.

   Now try to use the UDF. It will hang.

from udftest.nothing import do_nothing_udf
spark.range(10).select(do_nothing_udf('id')).show()  # hangs

   8.

   The strange thing is, if you define the exact same UDF directly in the
   active PySpark shell, it works fine! It’s only when you import it from a
   user-defined module that you see this issue.

​

On Thu, Jun 22, 2017 at 12:08 PM Nick Chammas 
wrote:

> I’m seeing a strange issue on EMR which I posted about here
> 
> .
>
> In brief, when I try to import a UDF I’ve defined, Python somehow fails to
> find Spark. This exact code works for me locally and works on our
> on-premises CDH cluster under YARN.
>
> This is the traceback:
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 318, in show
> print(self._jdf.showString(n, 20))
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o89.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, ip-10-97-35-12.ec2.internal, executor 1): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 161, in main
> func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 91, in read_udfs
> _, udf = read_single_udf(pickleSer, infile)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 78, in read_single_udf
> f, return_type = read_command(pickleSer, infile)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 54, in read_command
> command = serializer._read_with_length(file)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/serializers.py",
>  line 169, in _read_with_length
> return self.loads(obj)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/serializers.py",
>  line 451, in loads
> return pickle.loads(obj, encoding=encoding)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/splinkr/person.py",
>  line 7, in 
> from splinkr.util import repartition_to_size
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/splinkr/util.py",
>  line 34, in 
> containsNull=False,
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/sql/functions.py",
>  line 1872, in udf
> return UserDefinedFunction(f, returnType)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/sql/functions.py",
>  line 1830, in __init__
> self._judf = 

[jira] [Commented] (SPARK-21110) Structs should be usable in inequality filters

2017-06-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059536#comment-16059536
 ] 

Nicholas Chammas commented on SPARK-21110:
--

cc [~marmbrus] - Assuming this is a valid feature request, maybe it belongs as 
part of a larger umbrella task.

> Structs should be usable in inequality filters
> --
>
> Key: SPARK-21110
> URL: https://issues.apache.org/jira/browse/SPARK-21110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> It seems like a missing feature that you can't compare structs in a filter on 
> a DataFrame.
> Here's a simple demonstration of a) where this would be useful and b) how 
> it's different from simply comparing each of the components of the structs.
> {code}
> import pyspark
> from pyspark.sql.functions import col, struct, concat
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [
> ('Boston', 'Bob'),
> ('Boston', 'Nick'),
> ('San Francisco', 'Bob'),
> ('San Francisco', 'Nick'),
> ],
> ['city', 'person']
> )
> pairs = (
> df.select(
> struct('city', 'person').alias('p1')
> )
> .crossJoin(
> df.select(
> struct('city', 'person').alias('p2')
> )
> )
> )
> print("Everything")
> pairs.show()
> print("Comparing parts separately (doesn't give me what I want)")
> (pairs
> .where(col('p1.city') < col('p2.city'))
> .where(col('p1.person') < col('p2.person'))
> .show())
> print("Comparing parts together with concat (gives me what I want but is 
> hacky)")
> (pairs
> .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
> .show())
> print("Comparing parts together with struct (my desired solution but 
> currently yields an error)")
> (pairs
> .where(col('p1') < col('p2'))
> .show())
> {code}
> The last query yields the following error in Spark 2.1.1:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
> data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint 
> or int or bigint or float or double or decimal or timestamp or date or string 
> or binary) type, not struct<city:string,person:string>;;
> 'Filter (p1#5 < p2#8)
> +- Join Cross
>:- Project [named_struct(city, city#0, person, person#1) AS p1#5]
>:  +- LogicalRDD [city#0, person#1]
>+- Project [named_struct(city, city#0, person, person#1) AS p2#8]
>   +- LogicalRDD [city#0, person#1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21110) Structs should be usable in inequality filters

2017-06-15 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-21110:
-
Summary: Structs should be usable in inequality filters  (was: Structs 
should be orderable)

> Structs should be usable in inequality filters
> --
>
> Key: SPARK-21110
> URL: https://issues.apache.org/jira/browse/SPARK-21110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> It seems like a missing feature that you can't compare structs in a filter on 
> a DataFrame.
> Here's a simple demonstration of a) where this would be useful and b) how 
> it's different from simply comparing each of the components of the structs.
> {code}
> import pyspark
> from pyspark.sql.functions import col, struct, concat
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [
> ('Boston', 'Bob'),
> ('Boston', 'Nick'),
> ('San Francisco', 'Bob'),
> ('San Francisco', 'Nick'),
> ],
> ['city', 'person']
> )
> pairs = (
> df.select(
> struct('city', 'person').alias('p1')
> )
> .crossJoin(
> df.select(
> struct('city', 'person').alias('p2')
> )
> )
> )
> print("Everything")
> pairs.show()
> print("Comparing parts separately (doesn't give me what I want)")
> (pairs
> .where(col('p1.city') < col('p2.city'))
> .where(col('p1.person') < col('p2.person'))
> .show())
> print("Comparing parts together with concat (gives me what I want but is 
> hacky)")
> (pairs
> .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
> .show())
> print("Comparing parts together with struct (my desired solution but 
> currently yields an error)")
> (pairs
> .where(col('p1') < col('p2'))
> .show())
> {code}
> The last query yields the following error in Spark 2.1.1:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
> data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint 
> or int or bigint or float or double or decimal or timestamp or date or string 
> or binary) type, not struct<city:string,person:string>;;
> 'Filter (p1#5 < p2#8)
> +- Join Cross
>:- Project [named_struct(city, city#0, person, person#1) AS p1#5]
>:  +- LogicalRDD [city#0, person#1]
>+- Project [named_struct(city, city#0, person, person#1) AS p2#8]
>   +- LogicalRDD [city#0, person#1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21110) Structs should be orderable

2017-06-15 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-21110:


 Summary: Structs should be orderable
 Key: SPARK-21110
 URL: https://issues.apache.org/jira/browse/SPARK-21110
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Nicholas Chammas
Priority: Minor


It seems like a missing feature that you can't compare structs in a filter on a 
DataFrame.

Here's a simple demonstration of a) where this would be useful and b) how it's 
different from simply comparing each of the components of the structs.

{code}
import pyspark
from pyspark.sql.functions import col, struct, concat

spark = pyspark.sql.SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
[
('Boston', 'Bob'),
('Boston', 'Nick'),
('San Francisco', 'Bob'),
('San Francisco', 'Nick'),
],
['city', 'person']
)
pairs = (
df.select(
struct('city', 'person').alias('p1')
)
.crossJoin(
df.select(
struct('city', 'person').alias('p2')
)
)
)

print("Everything")
pairs.show()

print("Comparing parts separately (doesn't give me what I want)")
(pairs
.where(col('p1.city') < col('p2.city'))
.where(col('p1.person') < col('p2.person'))
.show())

print("Comparing parts together with concat (gives me what I want but is 
hacky)")
(pairs
.where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
.show())

print("Comparing parts together with struct (my desired solution but currently 
yields an error)")
(pairs
.where(col('p1') < col('p2'))
.show())
{code}

The last query yields the following error in Spark 2.1.1:

{code}
org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint or 
int or bigint or float or double or decimal or timestamp or date or string or 
binary) type, not struct<city:string,person:string>;;
'Filter (p1#5 < p2#8)
+- Join Cross
   :- Project [named_struct(city, city#0, person, person#1) AS p1#5]
   :  +- LogicalRDD [city#0, person#1]
   +- Project [named_struct(city, city#0, person, person#1) AS p2#8]
  +- LogicalRDD [city#0, person#1]
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2017-06-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035062#comment-16035062
 ] 

Nicholas Chammas commented on SPARK-12661:
--

I think we are good to resolve this provided that we've stopped testing with 
Python 2.6. Any cleanup of 2.6-specific workarounds (tracked in SPARK-20149) 
can be done separately IMO.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9862) Join: Handling data skew

2017-05-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020030#comment-16020030
 ] 

Nicholas Chammas commented on SPARK-9862:
-

Is this issue meant to be a SQL-equivalent of SPARK-4644 (as suggested by 
[~rxin] in the comments)?

> Join: Handling data skew
> 
>
> Key: SPARK-9862
> URL: https://issues.apache.org/jira/browse/SPARK-9862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
> Attachments: Handling skew data in join.pdf
>
>
> For a two way shuffle join, if one or multiple groups are skewed in one table 
> (say left table) but having a relative small number of rows in another table 
> (say right table), we can use broadcast join for these skewed groups and use 
> shuffle join for other groups.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-20 Thread Nicholas Chammas
Steve,

I think you're a good person to ask about this. Is the below any cause for
concern? Or did I perhaps test this incorrectly?

Nick


On Tue, Apr 18, 2017 at 11:50 PM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I had trouble starting up a shell with the AWS package loaded
> (specifically, org.apache.hadoop:hadoop-aws:2.7.3):
>
>
> [NOT FOUND  ] 
> com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms)
>
>  local-m2-cache: tried
>
>   
> file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar
>
> [NOT FOUND  ] 
> org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)
>
>  local-m2-cache: tried
>
>   
> file:/home/ec2-user/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar
>
> [NOT FOUND  ] 
> com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar (0ms)
>
>  local-m2-cache: tried
>
>   
> file:/home/ec2-user/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar
>
> ::
>
> ::  FAILED DOWNLOADS::
>
> :: ^ see resolution messages for details  ^ ::
>
> ::
>
> :: com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle)
>
> :: org.codehaus.jettison#jettison;1.1!jettison.jar(bundle)
>
> :: com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar
>
> :: com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle)
>
> ::
>
> Anyone know anything about this? I made sure to build Spark against the
> appropriate version of Hadoop.
>
> Nick
>
> On Tue, Apr 18, 2017 at 2:59 PM Michael Armbrust <mich...@databricks.com>
> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.1-rc3
>> <https://github.com/apache/spark/tree/v2.1.1-rc3> (
>> 2ed19cff2f6ab79a718526e5d16633412d8c4dd4)
>>
>> List of JIRA tickets resolved can be found with this filter
>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1230/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.0.
>>
>> *What happened to RC1?*
>>
>> There were issues with the release packaging and as a result was skipped.
>>
> ​
>


Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-18 Thread Nicholas Chammas
I had trouble starting up a shell with the AWS package loaded
(specifically, org.apache.hadoop:hadoop-aws:2.7.3):


[NOT FOUND  ]
com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms)

 local-m2-cache: tried

  
file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar

[NOT FOUND  ]
org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)

 local-m2-cache: tried

  
file:/home/ec2-user/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar

[NOT FOUND  ]
com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar (0ms)

 local-m2-cache: tried

  
file:/home/ec2-user/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar

::

::  FAILED DOWNLOADS::

:: ^ see resolution messages for details  ^ ::

::

:: com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle)

:: org.codehaus.jettison#jettison;1.1!jettison.jar(bundle)

:: com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar

:: com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle)

::

Anyone know anything about this? I made sure to build Spark against the
appropriate version of Hadoop.

Nick

On Tue, Apr 18, 2017 at 2:59 PM Michael Armbrust 
wrote:

Please vote on releasing the following candidate as Apache Spark version
> 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.1
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.1-rc3
>  (
> 2ed19cff2f6ab79a718526e5d16633412d8c4dd4)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1230/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.1.1?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.0.
>
> *What happened to RC1?*
>
> There were issues with the release packaging and as a result was skipped.
>
​


Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nicholas Chammas
Ah, that's why all the stuff about scheduler pools is under the
section "Scheduling
Within an Application
<https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application>".
 I am so used to talking to my coworkers about jobs in sense of
applications that I forgot your typical Spark application submits multiple
"jobs", each of which has multiple stages, etc.

So in my case I need to read up more closely about YARN queues
<https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>
since I want to share resources *across* applications. Thanks Mark!

On Wed, Apr 5, 2017 at 4:31 PM Mark Hamstra <m...@clearstorydata.com> wrote:

> `spark-submit` creates a new Application that will need to get resources
> from YARN. Spark's scheduler pools will determine how those resources are
> allocated among whatever Jobs run within the new Application.
>
> Spark's scheduler pools are only relevant when you are submitting multiple
> Jobs within a single Application (i.e., you are using the same SparkContext
> to launch multiple Jobs) and you have used SparkContext#setLocalProperty to
> set "spark.scheduler.pool" to something other than the default pool before
> a particular Job intended to use that pool is started via that SparkContext.
>
> On Wed, Apr 5, 2017 at 1:11 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> Hmm, so when I submit an application with `spark-submit`, I need to
> guarantee it resources using YARN queues and not Spark's scheduler pools.
> Is that correct?
>
> When are Spark's scheduler pools relevant/useful in this context?
>
> On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
> grrr... s/your/you're/
>
> On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
> Your mixing up different levels of scheduling. Spark's fair scheduler
> pools are about scheduling Jobs, not Applications; whereas YARN queues with
> Spark are about scheduling Applications, not Jobs.
>
> On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas <nicholas.cham...@gmail.com>
> wrote:
>
> I'm having trouble understanding the difference between Spark fair
> scheduler pools
> <https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools>
> and YARN queues
> <https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>.
> Do they conflict? Does one override the other?
>
> I posted a more detailed question about an issue I'm having with this on
> Stack Overflow: http://stackoverflow.com/q/43239921/877069
>
> Nick
>
>
> --
> View this message in context: Spark fair scheduler pools vs. YARN queues
> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-fair-scheduler-pools-vs-YARN-queues-tp28572.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>
>
>
>
>


Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nicholas Chammas
Hmm, so when I submit an application with `spark-submit`, I need to
guarantee it resources using YARN queues and not Spark's scheduler pools.
Is that correct?

When are Spark's scheduler pools relevant/useful in this context?

On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra  wrote:

> grrr... s/your/you're/
>
> On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra 
> wrote:
>
> Your mixing up different levels of scheduling. Spark's fair scheduler
> pools are about scheduling Jobs, not Applications; whereas YARN queues with
> Spark are about scheduling Applications, not Jobs.
>
> On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas 
> wrote:
>
> I'm having trouble understanding the difference between Spark fair
> scheduler pools
> 
> and YARN queues
> .
> Do they conflict? Does one override the other?
>
> I posted a more detailed question about an issue I'm having with this on
> Stack Overflow: http://stackoverflow.com/q/43239921/877069
>
> Nick
>
>
> --
> View this message in context: Spark fair scheduler pools vs. YARN queues
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
>
>
>


[jira] [Comment Edited] (SPARK-19553) Add GroupedData.countApprox()

2017-03-14 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870780#comment-15870780
 ] 

Nicholas Chammas edited comment on SPARK-19553 at 3/14/17 2:38 PM:
---

The utility of 1) would be being able to count items instead of distinct items, 
unless I misunderstood what you're saying. I would imagine that just counting 
items (as opposed to distinct items) would be cheaper, in addition to being 
semantically different.

-I'll open a PR for 3), unless someone else wants to step in and do that.-


was (Author: nchammas):
The utility of 1) would be being able to count items instead of distinct items, 
unless I misunderstood what you're saying. I would imagine that just counting 
items (as opposed to distinct items) would be cheaper, in addition to being 
semantically different.

I'll open a PR for 3), unless someone else wants to step in and do that.

> Add GroupedData.countApprox()
> -
>
> Key: SPARK-19553
> URL: https://issues.apache.org/jira/browse/SPARK-19553
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> We already have a 
> [{{pyspark.sql.functions.approx_count_distinct()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.approx_count_distinct]
>  that can be applied to grouped data, but it seems odd that you can't just 
> get regular approximate count for grouped data.
> I imagine the API would mirror that for 
> [{{RDD.countApprox()}}|http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.countApprox],
>  but I'm not sure:
> {code}
> (df
> .groupBy('col1')
> .countApprox(timeout=300, confidence=0.95)
> .show())
> {code}
> Or, if we want to mirror the {{approx_count_distinct()}} function, we can do 
> that too. I'd want to understand why that function doesn't take a timeout or 
> confidence parameter, though. Also, what does {{rsd}} mean? It's not 
> documented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Question on Spark's graph libraries roadmap

2017-03-13 Thread Nicholas Chammas
Since GraphFrames is not part of the Spark project, your
GraphFrames-specific questions are probably better directed at the
GraphFrames issue tracker:

https://github.com/graphframes/graphframes/issues

As far as I know, GraphFrames is an active project, though not as active as
Spark of course. There will be lulls in development since the people
driving that project forward also have major commitments to other projects.
This is natural.

If you post on GitHub I would wager somewhere there (maybe Joseph or Tim
<https://github.com/graphframes/graphframes/graphs/contributors>?) should
be able to answer your questions about GraphFrames.


   1. The page you linked refers to a *plan* to move GraphFrames to the
   standard Spark release cycle. Is this *plan* publicly available /
   visible?

I didn’t see any such reference to a plan in the page I linked you to.
Rather, the page says <http://graphframes.github.io/#what-are-graphframes>:

The current plan is to keep GraphFrames separate from core Apache Spark for
the time being.

Nick
​

On Mon, Mar 13, 2017 at 5:46 PM enzo <e...@smartinsightsfromdata.com> wrote:

> Nick
>
> Thanks for the quick answer :)
>
> Sadly, the comment in the page doesn’t answer my questions. More
> specifically:
>
> 1. GraphFrames last activity in github was 2 months ago.  Last release on 12
> Nov 2016.  Till recently 2 month was close to a Spark release cycle.  Why
> there has been no major development since mid November?
>
> 2. The page you linked refers to a *plan* to move GraphFrames to the
> standard Spark release cycle.  Is this *plan* publicly available / visible?
>
> 3. I couldn’t find any statement of intent to preserve either one or the
> other APIs, or just merge them: in other words, there seem to be no
> overarching plan for a cohesive & comprehensive graph API (I apologise in
> advance if I’m wrong).
>
> 4. I was initially impressed by GraphFrames syntax in places similar to
> Neo4J Cypher (now open source), but later I understood was an incomplete
> lightweight experiment (with no intention to move to full compatibility,
> perhaps for good reasons).  To me it sort of gave the wrong message.
>
> 5. In the mean time the world of graphs is changing. GraphBlas forum seems
> to make some traction: a library based on GraphBlas has been made available
> on Accumulo (Graphulo).  Assuming that Spark is NOT going to adopt similar
> lines, nor to follow Datastax with tinkertop and Gremlin, again, what is
> the new,  cohesive & comprehensive API that Spark is going to deliver?
>
>
> Sadly, the API uncertainty may force developers to more stable kind of API
> / platforms & roadmaps.
>
>
>
> Thanks Enzo
>
> On 13 Mar 2017, at 22:09, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
> Your question is answered here under "Will GraphFrames be part of Apache
> Spark?", no?
>
> http://graphframes.github.io/#what-are-graphframes
>
> Nick
>
> On Mon, Mar 13, 2017 at 4:56 PM enzo <e...@smartinsightsfromdata.com>
> wrote:
>
> Please see this email  trail:  no answer so far on the user@spark board.
> Trying the developer board for better luck
>
> The question:
>
> I am a bit confused by the current roadmap for graph and graph analytics
> in Apache Spark.
>
> I understand that we have had for some time two libraries (the following
> is my understanding - please amend as appropriate!):
>
> . GraphX, part of Spark project.  This library is based on RDD and it is
> only accessible via Scala.  It doesn’t look that this library has been
> enhanced recently.
> . GraphFrames, independent (at the moment?) library for Spark.  This
> library is based on Spark DataFrames and accessible by Scala & Python. Last
> commit on GitHub was 2 months ago.
>
> GraphFrames cam about with the promise at some point to be integrated in
> Apache Spark.
>
> I can see other projects coming up with interesting libraries and ideas
> (e.g. Graphulo on Accumulo, a new project with the goal of implementing
> the GraphBlas building blocks for graph algorithms on top of Accumulo).
>
> Where is Apache Spark going?
>
> Where are graph libraries in the roadmap?
>
>
>
> Thanks for any clarity brought to this matter.
>
> Thanks Enzo
>
> Begin forwarded message:
>
> *From: *"Md. Rezaul Karim" <rezaul.ka...@insight-centre.org>
> *Subject: **Re: Question on Spark's graph libraries*
> *Date: *10 March 2017 at 13:13:15 CET
> *To: *Robin East <robin.e...@xense.co.uk>
> *Cc: *enzo <e...@smartinsightsfromdata.com>, spark users <
> u...@spark.apache.org>
>
> +1
>
> Regards,
> _

Re: Question on Spark's graph libraries roadmap

2017-03-13 Thread Nicholas Chammas
Your question is answered here under "Will GraphFrames be part of Apache
Spark?", no?

http://graphframes.github.io/#what-are-graphframes

Nick

On Mon, Mar 13, 2017 at 4:56 PM enzo  wrote:

> Please see this email  trail:  no answer so far on the user@spark board.
> Trying the developer board for better luck
>
> The question:
>
> I am a bit confused by the current roadmap for graph and graph analytics
> in Apache Spark.
>
> I understand that we have had for some time two libraries (the following
> is my understanding - please amend as appropriate!):
>
> . GraphX, part of Spark project.  This library is based on RDD and it is
> only accessible via Scala.  It doesn’t look that this library has been
> enhanced recently.
> . GraphFrames, independent (at the moment?) library for Spark.  This
> library is based on Spark DataFrames and accessible by Scala & Python. Last
> commit on GitHub was 2 months ago.
>
> GraphFrames cam about with the promise at some point to be integrated in
> Apache Spark.
>
> I can see other projects coming up with interesting libraries and ideas
> (e.g. Graphulo on Accumulo, a new project with the goal of implementing
> the GraphBlas building blocks for graph algorithms on top of Accumulo).
>
> Where is Apache Spark going?
>
> Where are graph libraries in the roadmap?
>
>
>
> Thanks for any clarity brought to this matter.
>
> Thanks Enzo
>
> Begin forwarded message:
>
> *From: *"Md. Rezaul Karim" 
> *Subject: **Re: Question on Spark's graph libraries*
> *Date: *10 March 2017 at 13:13:15 CET
> *To: *Robin East 
> *Cc: *enzo , spark users <
> u...@spark.apache.org>
>
> +1
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
> On 10 March 2017 at 12:10, Robin East  wrote:
>
> I would love to know the answer to that too.
>
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 9 Mar 2017, at 17:42, enzo  wrote:
>
> I am a bit confused by the current roadmap for graph and graph analytics
> in Apache Spark.
>
> I understand that we have had for some time two libraries (the following
> is my understanding - please amend as appropriate!):
>
> . GraphX, part of Spark project.  This library is based on RDD and it is
> only accessible via Scala.  It doesn’t look that this library has been
> enhanced recently.
> . GraphFrames, independent (at the moment?) library for Spark.  This
> library is based on Spark DataFrames and accessible by Scala & Python. Last
> commit on GitHub was 2 months ago.
>
> GraphFrames cam about with the promise at some point to be integrated in
> Apache Spark.
>
> I can see other projects coming up with interesting libraries and ideas
> (e.g. Graphulo on Accumulo, a new project with the goal of implementing
> the GraphBlas building blocks for graph algorithms on top of Accumulo).
>
> Where is Apache Spark going?
>
> Where are graph libraries in the roadmap?
>
>
>
> Thanks for any clarity brought to this matter.
>
> Enzo
>
>
>
>
>


[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893703#comment-15893703
 ] 

Nicholas Chammas commented on SPARK-15474:
--

cc [~owen.omalley]

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19578) Poor pyspark performance + incorrect UI input-size metrics

2017-03-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890930#comment-15890930
 ] 

Nicholas Chammas commented on SPARK-19578:
--

Makes sense to me. I suppose the Apache Arrow integration work that is 
currently ongoing (SPARK-13534) will not help RDD.count() since that will only 
benefit DataFrames. (Granted, in this specific example you can always read the 
file using spark.read.csv() or spark.read.text() which will avoid this problem.)

So it sounds like the "poor PySpark performance" part of this issue is 
"Won't/Can't fix" at this time. The incorrect UI input-size metrics sounds like 
a separate issue that should be split out. 

> Poor pyspark performance + incorrect UI input-size metrics
> --
>
> Key: SPARK-19578
> URL: https://issues.apache.org/jira/browse/SPARK-19578
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 1.6.1, 1.6.2, 2.0.1
> Environment: Spark 1.6.2 Hortonworks
> Spark 2.0.1 MapR
> Spark 1.6.1 MapR
>Reporter: Artur Sukhenko
> Attachments: pyspark_incorrect_inputsize.png, reproduce_log, 
> spark_shell_correct_inputsize.png
>
>
> Simple job in pyspark takes 14 minutes to complete.
> The text file used to reproduce contains multiple millions lines of one word 
> "yes"
>  (it might be the cause of poor performance)
> {code}
> var a = sc.textFile("/tmp/yes.txt")
> a.count()
> {code}
> Same code took 33 sec in spark-shell
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> [spark@c6401 ~]$ yes > /tmp/yes.txt
> [spark@c6401 ~]$ ll /tmp/
> -rw-r--r--  1 spark hadoop  516079616 Feb 13 11:10 yes.txt
> [spark@c6401 ~]$ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> [spark@c6401 ~]$ pyspark
> {code}
> Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
> 17/02/13 11:12:36 INFO SparkContext: Running Spark version 1.6.2
> {code}
> >>> a = sc.textFile("/tmp/yes.txt")
> {code}
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 341.1 KB, free 341.1 KB)
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 28.3 KB, free 369.4 KB)
> 17/02/13 11:12:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43389 (size: 28.3 KB, free: 517.4 MB)
> 17/02/13 11:12:58 INFO SparkContext: Created broadcast 0 from textFile at 
> NativeMethodAccessorImpl.java:-2
> {code}
> >>> a.count()
> {code}
> 17/02/13 11:13:03 INFO FileInputFormat: Total input paths to process : 1
> 17/02/13 11:13:03 INFO SparkContext: Starting job: count at :1
> 17/02/13 11:13:03 INFO DAGScheduler: Got job 0 (count at :1) with 4 
> output partitions
> 17/02/13 11:13:03 INFO DAGScheduler: Final stage: ResultStage 0 (count at 
> :1)
> 17/02/13 11:13:03 INFO DAGScheduler: Parents of final stage: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Missing parents: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] 
> at count at :1), which has no missing parents
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 5.7 KB, free 375.1 KB)
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
> in memory (estimated size 3.5 KB, free 378.6 KB)
> 17/02/13 11:13:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
> on localhost:43389 (size: 3.5 KB, free: 517.4 MB)
> 17/02/13 11:13:03 INFO SparkContext: Created broadcast 1 from broadcast at 
> DAGScheduler.scala:1008
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting 4 missing tasks from 
> ResultStage 0 (PythonRDD[2] at count at :1)
> 17/02/13 11:13:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
> 17/02/13 11:13:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, partition 0,ANY, 2149 bytes)
> 17/02/13 11:13:03 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 17/02/13 11:13:03 INFO HadoopRDD: Input split: 
> hdfs://c6401.ambari.apache.org:8020/tmp/yes.txt:0+134217728
> 17/02/13 11:13:03 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
> mapreduce.task.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.id is deprecated. Instead, 
> use mapreduce.task.attempt.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.is.map is deprecated. 
> Instead, use mapreduce.task.ismap
> 17/02/13 11:13:03 INFO deprecation: mapred.task.

[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890639#comment-15890639
 ] 

Nicholas Chammas commented on SPARK-15474:
--

There is a related discussion on ORC-152 which suggests that this is an issue 
with Spark's DataFrame writer for ORC. If there is evidence that this is not 
the case, it would be good to post it directly on ORC-152 so we can get input 
from people on that project.

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19578) Poor pyspark performance + incorrect UI input-size metrics

2017-03-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890588#comment-15890588
 ] 

Nicholas Chammas commented on SPARK-19578:
--

[~holdenk] - Would it make sense to have PySpark's {{RDD.count()}} simply 
delegate to the underlying {{_jrdd}}? I'm not seeing why Python needs to be 
involved at all to return a count.

> Poor pyspark performance + incorrect UI input-size metrics
> --
>
> Key: SPARK-19578
> URL: https://issues.apache.org/jira/browse/SPARK-19578
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 1.6.1, 1.6.2, 2.0.1
> Environment: Spark 1.6.2 Hortonworks
> Spark 2.0.1 MapR
> Spark 1.6.1 MapR
>Reporter: Artur Sukhenko
> Attachments: pyspark_incorrect_inputsize.png, reproduce_log, 
> spark_shell_correct_inputsize.png
>
>
> Simple job in pyspark takes 14 minutes to complete.
> The text file used to reproduce contains multiple millions lines of one word 
> "yes"
>  (it might be the cause of poor performance)
> {code}
> var a = sc.textFile("/tmp/yes.txt")
> a.count()
> {code}
> Same code took 33 sec in spark-shell
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> [spark@c6401 ~]$ yes > /tmp/yes.txt
> [spark@c6401 ~]$ ll /tmp/
> -rw-r--r--  1 spark hadoop  516079616 Feb 13 11:10 yes.txt
> [spark@c6401 ~]$ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> [spark@c6401 ~]$ pyspark
> {code}
> Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
> 17/02/13 11:12:36 INFO SparkContext: Running Spark version 1.6.2
> {code}
> >>> a = sc.textFile("/tmp/yes.txt")
> {code}
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 341.1 KB, free 341.1 KB)
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 28.3 KB, free 369.4 KB)
> 17/02/13 11:12:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43389 (size: 28.3 KB, free: 517.4 MB)
> 17/02/13 11:12:58 INFO SparkContext: Created broadcast 0 from textFile at 
> NativeMethodAccessorImpl.java:-2
> {code}
> >>> a.count()
> {code}
> 17/02/13 11:13:03 INFO FileInputFormat: Total input paths to process : 1
> 17/02/13 11:13:03 INFO SparkContext: Starting job: count at :1
> 17/02/13 11:13:03 INFO DAGScheduler: Got job 0 (count at :1) with 4 
> output partitions
> 17/02/13 11:13:03 INFO DAGScheduler: Final stage: ResultStage 0 (count at 
> :1)
> 17/02/13 11:13:03 INFO DAGScheduler: Parents of final stage: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Missing parents: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] 
> at count at :1), which has no missing parents
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 5.7 KB, free 375.1 KB)
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
> in memory (estimated size 3.5 KB, free 378.6 KB)
> 17/02/13 11:13:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
> on localhost:43389 (size: 3.5 KB, free: 517.4 MB)
> 17/02/13 11:13:03 INFO SparkContext: Created broadcast 1 from broadcast at 
> DAGScheduler.scala:1008
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting 4 missing tasks from 
> ResultStage 0 (PythonRDD[2] at count at :1)
> 17/02/13 11:13:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
> 17/02/13 11:13:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, partition 0,ANY, 2149 bytes)
> 17/02/13 11:13:03 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 17/02/13 11:13:03 INFO HadoopRDD: Input split: 
> hdfs://c6401.ambari.apache.org:8020/tmp/yes.txt:0+134217728
> 17/02/13 11:13:03 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
> mapreduce.task.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.id is deprecated. Instead, 
> use mapreduce.task.attempt.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.is.map is deprecated. 
> Instead, use mapreduce.task.ismap
> 17/02/13 11:13:03 INFO deprecation: mapred.task.partition is deprecated. 
> Instead, use mapreduce.task.partition
> 17/02/13 11:13:03 INFO deprecation: mapred.job.id is deprecated. Instead, use 
> mapreduce.job.id
> 17/02/13 11:16:37 INFO PythonRunner: Times: total = 213335, boot = 317, init 
> = 445, finish = 212573
> 17/02/13 11:16:37 INFO Executor: Finished task 0

[jira] [Commented] (SPARK-18381) Wrong date conversion between spark and python for dates before 1583

2017-02-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888776#comment-15888776
 ] 

Nicholas Chammas commented on SPARK-18381:
--

Oh, and to provide additional information on why anyone would care about these 
kinds of dates: We work with a variety of external datasources that often use 
"0001-01-01" as a sentinel value representing an empty or unknown date.

> Wrong date conversion between spark and python for dates before 1583
> 
>
> Key: SPARK-18381
> URL: https://issues.apache.org/jira/browse/SPARK-18381
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Luca Caniparoli
>
> Dates before 1538 (julian/gregorian calendar transition) are processed 
> incorrectly. 
> * With python udf (datetime.strptime), .show() returns wrong dates but 
> .collect() returns correct dates
> * With pyspark.sql.functions.to_date, .show() shows correct dates but 
> .collect() returns wrong dates. Additionally, collecting '0001-01-01' returns 
> error when collecting dataframe. 
> {code:none}
> from pyspark.sql.types import DateType
> from pyspark.sql.functions import to_date, udf
> from datetime import datetime
> strToDate =  udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())
> l = [('0002-01-01', 1), ('1581-01-01', 2), ('1582-01-01', 3), ('1583-01-01', 
> 4), ('1584-01-01', 5), ('2012-01-21', 6)]
> l_older = [('0001-01-01', 1)]
> test_df = spark.createDataFrame(l, ["date_string", "number"])
> test_df_older = spark.createDataFrame(l_older, ["date_string", "number"])
> test_df_strptime = test_df.withColumn( "date_cast", 
> strToDate(test_df["date_string"]))
> test_df_todate = test_df.withColumn( "date_cast", 
> to_date(test_df["date_string"]))
> test_df_older_todate = test_df_older.withColumn( "date_cast", 
> to_date(test_df_older["date_string"]))
> test_df_strptime.show()
> test_df_todate.show()
> print test_df_strptime.collect()
> print test_df_todate.collect()
> print test_df_older_todate.collect()
> {code}
> {noformat}
> +---+--+--+
> |date_string|number| date_cast|
> +---+--+--+
> | 0002-01-01| 1|0002-01-03|
> | 1581-01-01| 2|1580-12-22|
> | 1582-01-01| 3|1581-12-22|
> | 1583-01-01| 4|1583-01-01|
> | 1584-01-01| 5|1584-01-01|
> | 2012-01-21| 6|2012-01-21|
> +---+--+--+
> +---+--+--+
> |date_string|number| date_cast|
> +---+--+--+
> | 0002-01-01| 1|0002-01-01|
> | 1581-01-01| 2|1581-01-01|
> | 1582-01-01| 3|1582-01-01|
> | 1583-01-01| 4|1583-01-01|
> | 1584-01-01| 5|1584-01-01|
> | 2012-01-21| 6|2012-01-21|
> +---+--+--+
> [Row(date_string=u'0002-01-01', number=1, date_cast=datetime.date(2, 1, 1)), 
> Row(date_string=u'1581-01-01', number=2, date_cast=datetime.date(1581, 1, 
> 1)), Row(date_string=u'1582-01-01', number=3, date_cast=datetime.date(1582, 
> 1, 1)), Row(date_string=u'1583-01-01', number=4, 
> date_cast=datetime.date(1583, 1, 1)), Row(date_string=u'1584-01-01', 
> number=5, date_cast=datetime.date(1584, 1, 1)), 
> Row(date_string=u'2012-01-21', number=6, date_cast=datetime.date(2012, 1, 
> 21))]
> [Row(date_string=u'0002-01-01', number=1, date_cast=datetime.date(1, 12, 
> 30)), Row(date_string=u'1581-01-01', number=2, date_cast=datetime.date(1581, 
> 1, 11)), Row(date_string=u'1582-01-01', number=3, 
> date_cast=datetime.date(1582, 1, 11)), Row(date_string=u'1583-01-01', 
> number=4, date_cast=datetime.date(1583, 1, 1)), 
> Row(date_string=u'1584-01-01', number=5, date_cast=datetime.date(1584, 1, 
> 1)), Row(date_string=u'2012-01-21', number=6, date_cast=datetime.date(2012, 
> 1, 21))]
> Traceback (most recent call last):
>   File "/tmp/zeppelin_pyspark-6043517212596195478.py", line 267, in 
> raise Exception(traceback.format_exc())
> Exception: Traceback (most recent call last):
>   File "/tmp/zeppelin_pyspark-6043517212596195478.py", line 265, in 
> exec(code)
>   File "", line 15, in 
>   File "/usr/local/spark/python/pyspark/sql/dataframe.py", line 311, in 
> collect
> return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
>   File "/usr/local/spark/python/pyspark/rdd.py", line 142, in 
> _load_from_socket
> for item in serializer.load_stream(rf):
>   File "/

[jira] [Commented] (SPARK-18381) Wrong date conversion between spark and python for dates before 1583

2017-02-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888752#comment-15888752
 ] 

Nicholas Chammas commented on SPARK-18381:
--

I am seeing a very similar issue when trying to read some date data from 
Parquet. When I tried to create a minimal repro, I uncovered this error, which 
is probably related to what is reported above:

{code}
>>> spark.createDataFrame([(datetime.date(1, 1, 1),)], ('date',)).show(1)
+--+
|  date|
+--+
|0001-01-03|
+--+
{code}

I'm not sure how Jan 1 became Jan 3, but something's obviously wrong.

Here's another weird example:

{code}
>>> spark.createDataFrame([(datetime.date(1000, 10, 10),)], ('date',)).show(1)
+--+
|  date|
+--+
|1000-10-04|
+--+
{code}

In both cases, accessing the underlying data from the RDD returns the correct 
result:

{code}
>>> spark.createDataFrame([(datetime.date(1, 1, 1),)], ('date',)).take(1)
[Row(date=datetime.date(1, 1, 1))]

>>> spark.createDataFrame([(datetime.date(1000, 10, 10),)], ('date',)).take(1)
[Row(date=datetime.date(1000, 10, 10))]
{code}

I'm guessing there is a problem [somewhere in 
here|https://github.com/apache/spark/blob/9734a928a75d29ea202e9f309f92ca4637d35671/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala]
 that is causing what Luca and I are seeing. Specifically, I am suspicious of 
[these 
lines|https://github.com/apache/spark/blob/9734a928a75d29ea202e9f309f92ca4637d35671/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L173-L185].
 There are probably certain historical events which are being handled by some 
Java library when converting number of days to dates that are not being 
respected by Spark.

cc [~davies] [~holdenk]

> Wrong date conversion between spark and python for dates before 1583
> 
>
> Key: SPARK-18381
> URL: https://issues.apache.org/jira/browse/SPARK-18381
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Luca Caniparoli
>
> Dates before 1538 (julian/gregorian calendar transition) are processed 
> incorrectly. 
> * With python udf (datetime.strptime), .show() returns wrong dates but 
> .collect() returns correct dates
> * With pyspark.sql.functions.to_date, .show() shows correct dates but 
> .collect() returns wrong dates. Additionally, collecting '0001-01-01' returns 
> error when collecting dataframe. 
> {code:none}
> from pyspark.sql.types import DateType
> from pyspark.sql.functions import to_date, udf
> from datetime import datetime
> strToDate =  udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())
> l = [('0002-01-01', 1), ('1581-01-01', 2), ('1582-01-01', 3), ('1583-01-01', 
> 4), ('1584-01-01', 5), ('2012-01-21', 6)]
> l_older = [('0001-01-01', 1)]
> test_df = spark.createDataFrame(l, ["date_string", "number"])
> test_df_older = spark.createDataFrame(l_older, ["date_string", "number"])
> test_df_strptime = test_df.withColumn( "date_cast", 
> strToDate(test_df["date_string"]))
> test_df_todate = test_df.withColumn( "date_cast", 
> to_date(test_df["date_string"]))
> test_df_older_todate = test_df_older.withColumn( "date_cast", 
> to_date(test_df_older["date_string"]))
> test_df_strptime.show()
> test_df_todate.show()
> print test_df_strptime.collect()
> print test_df_todate.collect()
> print test_df_older_todate.collect()
> {code}
> {noformat}
> +---+--+--+
> |date_string|number| date_cast|
> +---+--+--+
> | 0002-01-01| 1|0002-01-03|
> | 1581-01-01| 2|1580-12-22|
> | 1582-01-01| 3|1581-12-22|
> | 1583-01-01| 4|1583-01-01|
> | 1584-01-01| 5|1584-01-01|
> | 2012-01-21| 6|2012-01-21|
> +---+--+--+
> +---+--+--+
> |date_string|number| date_cast|
> +---+--+--+
> | 0002-01-01| 1|0002-01-01|
> | 1581-01-01| 2|1581-01-01|
> | 1582-01-01| 3|1582-01-01|
> | 1583-01-01| 4|1583-01-01|
> | 1584-01-01| 5|1584-01-01|
> | 2012-01-21| 6|2012-01-21|
> +---+--+--+
> [Row(date_string=u'0002-01-01', number=1, date_cast=datetime.date(2, 1, 1)), 
> Row(date_string=u'1581-01-01', number=2, date_cast=datetime.date(1581, 1, 
> 1)), Row(date_string=u'1582-01-01', number=3, date_cast=datetime.date(1582, 
> 1, 1)), Row(date_string=u'1583-01-01', number=4, 
> date_cast=datetime.date(1583, 1, 1)), Row(

Re: New Amazon AMIs for EC2 script

2017-02-23 Thread Nicholas Chammas
spark-ec2 has moved to GitHub and is no longer part of the Spark project. A
related issue from the current issue tracker that you may want to
follow/comment on is this one: https://github.com/amplab/spark-ec2/issues/74

As I said there, I think requiring custom AMIs is one of the major
maintenance headaches of spark-ec2. I solved this problem in my own
project, Flintrock , by working with
the default Amazon Linux AMIs and letting people more freely bring their
own AMI.

Nick


On Thu, Feb 23, 2017 at 7:23 AM in4maniac  wrote:

> Hyy all,
>
> I have been using the EC2 script to launch R pyspark clusters for a while
> now. As we use alot of packages such as numpy and scipy with openblas,
> scikit-learn, bokeh, vowpal wabbit, pystan and etc... All this time, we
> have
> been building AMIs on top of the standard spark-AMIs at
> https://github.com/amplab/spark-ec2/tree/branch-1.6/ami-list/us-east-1
>
> Mainly, I have done the following:
> - updated yum
> - Changed the standard python to python 2.7
> - changed pip to 2.7 and installed alot of libararies on top of the
> existing
> AMIs and created my own AMIs to avoid having to boostrap.
>
> But the ec-2 standard AMIs are from *Early February , 2014* and now have
> become extremely fragile. For example, when I update a certain library,
> ipython would break, or pip would break and so forth.
>
> Can someone please direct me to a more upto date AMI that I can use with
> more confidence. And I am also interested to know what things need to be in
> the AMI, if I wanted to build an AMI from scratch (Last resort :( )
>
> And isn't it time to have a ticket in the spark project to build a new
> suite
> of AMIs for the EC2 script?
> https://issues.apache.org/jira/browse/SPARK-922
>
> Many thanks
> in4maniac
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/New-Amazon-AMIs-for-EC2-script-tp28419.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Will .count() always trigger an evaluation of each row?

2017-02-17 Thread Nicholas Chammas
Especially during development, people often use .count() or
.persist().count() to force evaluation of all rows — exposing any problems,
e.g. due to bad data — and to load data into cache to speed up subsequent
operations.

But as the optimizer gets smarter, I’m guessing it will eventually learn
that it doesn’t have to do all that work to give the correct count. (This
blog post

suggests that something like this is already happening.) This will change
Spark’s practical behavior while technically preserving semantics.

What will people need to do then to force evaluation or caching?

Nick
​


[jira] [Commented] (SPARK-19553) Add GroupedData.countApprox()

2017-02-16 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870780#comment-15870780
 ] 

Nicholas Chammas commented on SPARK-19553:
--

The utility of 1) would be being able to count items instead of distinct items, 
unless I misunderstood what you're saying. I would imagine that just counting 
items (as opposed to distinct items) would be cheaper, in addition to being 
semantically different.

I'll open a PR for 3), unless someone else wants to step in and do that.

> Add GroupedData.countApprox()
> -
>
> Key: SPARK-19553
> URL: https://issues.apache.org/jira/browse/SPARK-19553
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> We already have a 
> [{{pyspark.sql.functions.approx_count_distinct()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.approx_count_distinct]
>  that can be applied to grouped data, but it seems odd that you can't just 
> get regular approximate count for grouped data.
> I imagine the API would mirror that for 
> [{{RDD.countApprox()}}|http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.countApprox],
>  but I'm not sure:
> {code}
> (df
> .groupBy('col1')
> .countApprox(timeout=300, confidence=0.95)
> .show())
> {code}
> Or, if we want to mirror the {{approx_count_distinct()}} function, we can do 
> that too. I'd want to understand why that function doesn't take a timeout or 
> confidence parameter, though. Also, what does {{rsd}} mean? It's not 
> documented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Structured Streaming Spark Summit Demo - Databricks people

2017-02-15 Thread Nicholas Chammas
I don't think this is the right place for questions about Databricks. I'm
pretty sure they have their own website with a forum for questions about
their product.

Maybe this? https://forums.databricks.com/

On Wed, Feb 15, 2017 at 2:34 PM Sam Elamin  wrote:

> Hey folks
>
> This one is mainly aimed at the databricks folks, I have been trying to
> replicate the cloudtrail demo
>  Micheal did at Spark
> Summit. The code for it can be found here
> 
>
> My question is how did you get the results to be displayed and updated
> continusly in real time
>
> I am also using databricks to duplicate it but I noticed the code link
> mentions
>
>  "If you count the number of rows in the table, you should find the value
> increasing over time. Run the following every few minutes."
> This leads me to believe that the version of Databricks that Micheal was
> using for the demo is still not released, or at-least the functionality to
> display those changes in real time aren't
>
> Is this the case? or am I completely wrong?
>
> Can I display the results of a structured streaming query in realtime
> using the databricks "display" function?
>
>
> Regards
> Sam
>


<    1   2   3   4   5   6   7   8   9   10   >