Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-13 Thread Shivaram Venkataraman
The R artifacts have some issue that Felix and I are debugging. Lets not
block the announcement for that.

Thanks
Shivaram

On Wed, Dec 13, 2017 at 5:59 AM, Sean Owen  wrote:

> Looks like Maven artifacts are up, site's up -- what about the Python and
> R artifacts?
> I can also move the spark.apache/docs/latest link to point to 2.2.1 if
> it's pretty ready.
> We should announce the release officially too then.
>
> On Wed, Dec 6, 2017 at 5:00 PM Felix Cheung 
> wrote:
>
>> I saw the svn move on Monday so I’m working on the website updates.
>>
>> I will look into maven today. I will ask if I couldn’t do it.
>>
>>
>> On Wed, Dec 6, 2017 at 10:49 AM Sean Owen  wrote:
>>
>>> Pardon, did this release finish? I don't see it in Maven. I know there
>>> was some question about getting a hand in finishing the release process,
>>> including copying artifacts in svn. Was there anything else you're waiting
>>> on someone to do?
>>>
>>>
>>> On Fri, Dec 1, 2017 at 2:10 AM Felix Cheung 
>>> wrote:
>>>
 This vote passes. Thanks everyone for testing this release.


 +1:

 Sean Owen (binding)

 Herman van Hövell tot Westerflier (binding)

 Wenchen Fan (binding)

 Shivaram Venkataraman (binding)

 Felix Cheung

 Henry Robinson

 Hyukjin Kwon

 Dongjoon Hyun

 Kazuaki Ishizaki

 Holden Karau

 Weichen Xu


 0: None

 -1: None

>>>


Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-13 Thread Sean Owen
Looks like Maven artifacts are up, site's up -- what about the Python and R
artifacts?
I can also move the spark.apache/docs/latest link to point to 2.2.1 if it's
pretty ready.
We should announce the release officially too then.

On Wed, Dec 6, 2017 at 5:00 PM Felix Cheung  wrote:

> I saw the svn move on Monday so I’m working on the website updates.
>
> I will look into maven today. I will ask if I couldn’t do it.
>
>
> On Wed, Dec 6, 2017 at 10:49 AM Sean Owen  wrote:
>
>> Pardon, did this release finish? I don't see it in Maven. I know there
>> was some question about getting a hand in finishing the release process,
>> including copying artifacts in svn. Was there anything else you're waiting
>> on someone to do?
>>
>>
>> On Fri, Dec 1, 2017 at 2:10 AM Felix Cheung 
>> wrote:
>>
>>> This vote passes. Thanks everyone for testing this release.
>>>
>>>
>>> +1:
>>>
>>> Sean Owen (binding)
>>>
>>> Herman van Hövell tot Westerflier (binding)
>>>
>>> Wenchen Fan (binding)
>>>
>>> Shivaram Venkataraman (binding)
>>>
>>> Felix Cheung
>>>
>>> Henry Robinson
>>>
>>> Hyukjin Kwon
>>>
>>> Dongjoon Hyun
>>>
>>> Kazuaki Ishizaki
>>>
>>> Holden Karau
>>>
>>> Weichen Xu
>>>
>>>
>>> 0: None
>>>
>>> -1: None
>>>
>>


Re: Leveraging S3 select

2017-12-13 Thread Steve Loughran


On 8 Dec 2017, at 17:05, Andrew Duffy 
> wrote:

Hey Steve,

Happen to have a link to the TPC-DS benchmark data w/random S3 reads? I've done 
a decent amount of digging, but all I've found is a reference in a slide deck

Is that one of mine?

We haven't done any benchmarking with/without random IO for a while, as we've 
taken that as a given and worrying about the other aspects of the problem: 
speeding up directory listings & getFileStatus calls (used a lot in the 
serialized partitioning phase), and making direct commits of work to S3 both 
correct and performant.

I'm trying to sort out some benchmarking there, which involves: cherry picking 
the new changes to an internal release, building that, having someone who 
understands benchmarking set up a cluster and run the tests, which involves 
their time and the cost of the clusters. I say clusters as it'll inevitably 
involve playing with different VM options and some EMR clusters alongside(*).

One bit of fun there becomes the fact that different instances of the same 
cluster specs may give different numbers; it depends on actual CPUs allocated, 
network, neighbours. When we do publish some numbers, we do it from a single 
cluster instance, rather than doing "best per-test outcome on multiple 
clusters". Good to check if others do the same.

Otherwise: test with your own code & the Hadoop 2.8.1+ JARs; see what numbers 
you get. If you are using Parquet or ORC, I would not consider using the 
sequential IO code. At the same time, if you are working with CSV, Avro, gzip, 
you don't want to use it, because what would be a single file GET with some 
forward skips of read & discard of data is now a slow sequence of GETs with 
latency between each one.
HADOOP-14965 (not yet 
committed) changes the default policy of an input stream to "switch to random 
IO mode on the first backwards seek", so you don't need to decide upfront that 
to use. There's the potential cost of the first HTTPS abort on that initial 
backwards seek, but after, random IO all the way. the Wasb client has been 
doing this for a while and everyone is happy, not least because its one less 
tuning option to document & test, and eliminates a whole class of support calls 
"client is fast to read .csv but not .orc files".

-Steve


(*) I have a hadoop branch-2.9 fork with the new committer stuff in if someone 
wants to compare numbers there. Bear in mind that the current 
RDD.write(s3a://something) command, when it uses the Hadoop FileOutputFormats 
and hence the FileOutputCommitter is not just observably a slow O(data) kind of 
operation, it is *not correct*, so the performance is just a detail. It's the 
one you notice, but not the issue to fear. Fixed by HADOOP-13786 & a bit of 
glue to keep Spark happy


Hinge Gradient

2017-12-13 Thread Debasish Das
Hi,

I looked into the LinearSVC flow and found the gradient for hinge as
follows:

Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x)))
Therefore the gradient is -(2y - 1)*x

max is a non-smooth function.

Did we try using ReLu/Softmax function and use that to smooth the hinge
loss ?

Loss function will change to SoftMax(0, 1 - (2y-1) (f_w(x)))

Since this function is smooth, gradient will be well defined and
LBFGS/OWLQN should behave well.

Please let me know if this has been tried already. If not I can run some
benchmarks.

We have soft-max in multinomial regression and can be reused for LinearSVC
flow.

Thanks.
Deb


Re: Decimals

2017-12-13 Thread Reynold Xin
Responses inline

On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido  wrote:

> Hi all,
>
> I saw in these weeks that there are a lot of problems related to decimal
> values (SPARK-22036, SPARK-22755, for instance). Some are related to
> historical choices, which I don't know, thus please excuse me if I am
> saying dumb things:
>
>  - why are we interpreting literal constants in queries as Decimal and not
> as Double? I think it is very unlikely that a user can enter a number which
> is beyond Double precision.
>

Probably just to be consistent with some popular databases.



>  - why are we returning null in case of precision loss? Is this approach
> better than just giving a result which might loose some accuracy?
>

The contract with decimal is that it should never lose precision (it is
created for financial reports, accounting, etc). Returning null is at least
telling the user the data type can no longer support the precision required.



>
> Thanks,
> Marco
>