Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
The join and joinWith are just two different join semantics, and is not
about Dataset vs DataFrame.

join is the relational join, where fields are flattened; joinWith is more
like a tuple join, where the output has two fields that are nested.

So you can do

Dataset[A] joinWith Dataset[B] = Dataset[(A, B)]

DataFrame[A] joinWith DataFrame[B] = Dataset[(Row, Row)]

Dataset[A] join Dataset[B] = Dataset[Row]

DataFrame[A] join DataFrame[B] = Dataset[Row]



On Thu, Feb 25, 2016 at 11:37 PM, Sun, Rui  wrote:

> Vote for option 2.
>
> Source compatibility and binary compatibility are very important from
> user’s perspective.
>
> It ‘s unfair for Java developers that they don’t have DataFrame
> abstraction. As you said, sometimes it is more natural to think about
> DataFrame.
>
>
>
> I am wondering if conceptually there is slight subtle difference between
> DataFrame and Dataset[Row]? For example,
>
> Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
>
> So,
>
> Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]
>
>
>
> While
>
> DataFrame join DataFrame is still DataFrame of Row?
>
>
>
> *From:* Reynold Xin [mailto:r...@databricks.com]
> *Sent:* Friday, February 26, 2016 8:52 AM
> *To:* Koert Kuipers 
> *Cc:* dev@spark.apache.org
> *Subject:* Re: [discuss] DataFrame vs Dataset in Spark 2.0
>
>
>
> Yes - and that's why source compatibility is broken.
>
>
>
> Note that it is not just a "convenience" thing. Conceptually DataFrame is
> a Dataset[Row], and for some developers it is more natural to think about
> "DataFrame" rather than "Dataset[Row]".
>
>
>
> If we were in C++, DataFrame would've been a type alias for Dataset[Row]
> too, and some methods would return DataFrame (e.g. sql method).
>
>
>
>
>
>
>
> On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers  wrote:
>
> since a type alias is purely a convenience thing for the scala compiler,
> does option 1 mean that the concept of DataFrame ceases to exist from a
> java perspective, and they will have to refer to Dataset?
>
>
>
> On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin  wrote:
>
> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
>
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
>
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
>
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
>
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
>
> + A lot less code
>
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
>
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
>
> The pros/cons are basically the inverse of Option 1.
>
>
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
>
> - A lot more code (1000+ loc)
>
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
>
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>
>
>
>
>
>
>


RE: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Sun, Rui
Vote for option 2.
Source compatibility and binary compatibility are very important from user’s 
perspective.
It ‘s unfair for Java developers that they don’t have DataFrame abstraction. As 
you said, sometimes it is more natural to think about DataFrame.

I am wondering if conceptually there is slight subtle difference between 
DataFrame and Dataset[Row]? For example,
Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
So,
Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]

While
DataFrame join DataFrame is still DataFrame of Row?

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Friday, February 26, 2016 8:52 AM
To: Koert Kuipers 
Cc: dev@spark.apache.org
Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0

Yes - and that's why source compatibility is broken.

Note that it is not just a "convenience" thing. Conceptually DataFrame is a 
Dataset[Row], and for some developers it is more natural to think about 
"DataFrame" rather than "Dataset[Row]".

If we were in C++, DataFrame would've been a type alias for Dataset[Row] too, 
and some methods would return DataFrame (e.g. sql method).



On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers 
> wrote:
since a type alias is purely a convenience thing for the scala compiler, does 
option 1 mean that the concept of DataFrame ceases to exist from a java 
perspective, and they will have to refer to Dataset?

On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin 
> wrote:
When we first introduced Dataset in 1.6 as an experimental API, we wanted to 
merge Dataset/DataFrame but couldn't because we didn't want to break the 
pre-existing DataFrame API (e.g. map function should return Dataset, rather 
than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and 
Dataset.

Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways 
to implement this:

Option 1. Make DataFrame a type alias for Dataset[Row]

Option 2. DataFrame as a concrete class that extends Dataset[Row]


I'm wondering what you think about this. The pros and cons I can think of are:


Option 1. Make DataFrame a type alias for Dataset[Row]

+ Cleaner conceptually, especially in Scala. It will be very clear what 
libraries or applications need to do, and we won't see type mismatches (e.g. a 
function expects DataFrame, but user is passing in Dataset[Row]
+ A lot less code
- Breaks source compatibility for the DataFrame API in Java, and binary 
compatibility for Scala/Java


Option 2. DataFrame as a concrete class that extends Dataset[Row]

The pros/cons are basically the inverse of Option 1.

+ In most cases, can maintain source compatibility for the DataFrame API in 
Java, and binary compatibility for Scala/Java
- A lot more code (1000+ loc)
- Less cleaner, and can be confusing when users pass in a Dataset[Row] into a 
function that expects a DataFrame


The concerns are mostly with Scala/Java. For Python, it is very easy to 
maintain source compatibility for both (there is no concept of binary 
compatibility), and for R, we are only supporting the DataFrame operations 
anyway because that's more familiar interface for R users outside of Spark.






Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
Yes - and that's why source compatibility is broken.

Note that it is not just a "convenience" thing. Conceptually DataFrame is a
Dataset[Row], and for some developers it is more natural to think about
"DataFrame" rather than "Dataset[Row]".

If we were in C++, DataFrame would've been a type alias for Dataset[Row]
too, and some methods would return DataFrame (e.g. sql method).



On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers  wrote:

> since a type alias is purely a convenience thing for the scala compiler,
> does option 1 mean that the concept of DataFrame ceases to exist from a
> java perspective, and they will have to refer to Dataset?
>
> On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin  wrote:
>
>> When we first introduced Dataset in 1.6 as an experimental API, we wanted
>> to merge Dataset/DataFrame but couldn't because we didn't want to break the
>> pre-existing DataFrame API (e.g. map function should return Dataset, rather
>> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
>> and Dataset.
>>
>> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are
>> two ways to implement this:
>>
>> Option 1. Make DataFrame a type alias for Dataset[Row]
>>
>> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>>
>>
>> I'm wondering what you think about this. The pros and cons I can think of
>> are:
>>
>>
>> Option 1. Make DataFrame a type alias for Dataset[Row]
>>
>> + Cleaner conceptually, especially in Scala. It will be very clear what
>> libraries or applications need to do, and we won't see type mismatches
>> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
>> + A lot less code
>> - Breaks source compatibility for the DataFrame API in Java, and binary
>> compatibility for Scala/Java
>>
>>
>> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>>
>> The pros/cons are basically the inverse of Option 1.
>>
>> + In most cases, can maintain source compatibility for the DataFrame API
>> in Java, and binary compatibility for Scala/Java
>> - A lot more code (1000+ loc)
>> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
>> into a function that expects a DataFrame
>>
>>
>> The concerns are mostly with Scala/Java. For Python, it is very easy to
>> maintain source compatibility for both (there is no concept of binary
>> compatibility), and for R, we are only supporting the DataFrame operations
>> anyway because that's more familiar interface for R users outside of Spark.
>>
>>
>>
>


Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Koert Kuipers
since a type alias is purely a convenience thing for the scala compiler,
does option 1 mean that the concept of DataFrame ceases to exist from a
java perspective, and they will have to refer to Dataset?

On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin  wrote:

> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
> + A lot less code
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
> The pros/cons are basically the inverse of Option 1.
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
> - A lot more code (1000+ loc)
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>


Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
It might make sense, but this option seems to carry all the cons of Option
2, and yet doesn't provide compatibility for Java?

On Thu, Feb 25, 2016 at 3:31 PM, Michael Malak 
wrote:

> Would it make sense (in terms of feasibility, code organization, and
> politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra
> lines to a Java compatibility layer/class?
>
>
> --
> *From:* Reynold Xin 
> *To:* "dev@spark.apache.org" 
> *Sent:* Thursday, February 25, 2016 4:23 PM
> *Subject:* [discuss] DataFrame vs Dataset in Spark 2.0
>
> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
> + A lot less code
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
> The pros/cons are basically the inverse of Option 1.
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
> - A lot more code (1000+ loc)
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>
>
>


Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Michael Malak
Would it make sense (in terms of feasibility, code organization, and 
politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra lines 
to a Java compatibility layer/class?

  From: Reynold Xin 
 To: "dev@spark.apache.org"  
 Sent: Thursday, February 25, 2016 4:23 PM
 Subject: [discuss] DataFrame vs Dataset in Spark 2.0
   
When we first introduced Dataset in 1.6 as an experimental API, we wanted to 
merge Dataset/DataFrame but couldn't because we didn't want to break the 
pre-existing DataFrame API (e.g. map function should return Dataset, rather 
than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and 
Dataset.
Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways 
to implement this:
Option 1. Make DataFrame a type alias for Dataset[Row]
Option 2. DataFrame as a concrete class that extends Dataset[Row]

I'm wondering what you think about this. The pros and cons I can think of are:

Option 1. Make DataFrame a type alias for Dataset[Row]
+ Cleaner conceptually, especially in Scala. It will be very clear what 
libraries or applications need to do, and we won't see type mismatches (e.g. a 
function expects DataFrame, but user is passing in Dataset[Row]
+ A lot less code- Breaks source compatibility for the DataFrame API in Java, 
and binary compatibility for Scala/Java

Option 2. DataFrame as a concrete class that extends Dataset[Row]
The pros/cons are basically the inverse of Option 1.
+ In most cases, can maintain source compatibility for the DataFrame API in 
Java, and binary compatibility for Scala/Java- A lot more code (1000+ loc)- 
Less cleaner, and can be confusing when users pass in a Dataset[Row] into a 
function that expects a DataFrame

The concerns are mostly with Scala/Java. For Python, it is very easy to 
maintain source compatibility for both (there is no concept of binary 
compatibility), and for R, we are only supporting the DataFrame operations 
anyway because that's more familiar interface for R users outside of Spark.



  

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Chester Chen
vote for Option 1.
  1)  Since 2.0 is major API, we are expecting some API changes,
  2)  It helps long term code base maintenance with short term pain on Java
side
  3) Not quite sure how large the code base is using Java DataFrame APIs.





On Thu, Feb 25, 2016 at 3:23 PM, Reynold Xin  wrote:

> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
> + A lot less code
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
> The pros/cons are basically the inverse of Option 1.
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
> - A lot more code (1000+ loc)
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>


[discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
When we first introduced Dataset in 1.6 as an experimental API, we wanted
to merge Dataset/DataFrame but couldn't because we didn't want to break the
pre-existing DataFrame API (e.g. map function should return Dataset, rather
than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
and Dataset.

Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
ways to implement this:

Option 1. Make DataFrame a type alias for Dataset[Row]

Option 2. DataFrame as a concrete class that extends Dataset[Row]


I'm wondering what you think about this. The pros and cons I can think of
are:


Option 1. Make DataFrame a type alias for Dataset[Row]

+ Cleaner conceptually, especially in Scala. It will be very clear what
libraries or applications need to do, and we won't see type mismatches
(e.g. a function expects DataFrame, but user is passing in Dataset[Row]
+ A lot less code
- Breaks source compatibility for the DataFrame API in Java, and binary
compatibility for Scala/Java


Option 2. DataFrame as a concrete class that extends Dataset[Row]

The pros/cons are basically the inverse of Option 1.

+ In most cases, can maintain source compatibility for the DataFrame API in
Java, and binary compatibility for Scala/Java
- A lot more code (1000+ loc)
- Less cleaner, and can be confusing when users pass in a Dataset[Row] into
a function that expects a DataFrame


The concerns are mostly with Scala/Java. For Python, it is very easy to
maintain source compatibility for both (there is no concept of binary
compatibility), and for R, we are only supporting the DataFrame operations
anyway because that's more familiar interface for R users outside of Spark.


Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Łukasz Gieroń
Thank you, your version of the mvn invocation (as opposed to mine bare "mvn
eclipse:eclipse") worked perfectly.

On Thu, Feb 25, 2016 at 3:22 PM, Yin Yang  wrote:

> In yarn/.classpath , I see:
>   
>
> Here is the command I used:
>
> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.7.0 package -DskipTests eclipse:eclipse
>
> FYI
>
> On Thu, Feb 25, 2016 at 6:13 AM, Łukasz Gieroń  wrote:
>
>> I've just checked, and "mvn eclipse:eclipse" generates incorrect projects
>> as well.
>>
>>
>> On Thu, Feb 25, 2016 at 3:04 PM, Allen Zhang 
>> wrote:
>>
>>> why not use maven
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2016-02-25 21:55:49, "lgieron"  wrote:
>>> >The Spark projects generated by sbt eclipse plugin have incorrect dependent
>>> >projects (as visible on Properties -> Java Build Path -> Projects tab). All
>>> >dependent project are missing the "_2.11" suffix (for example, it's
>>> >"spark-core" instead of correct "spark-core_2.11"). This of course causes
>>> >the build to fail.
>>> >
>>> >I am using sbteclipse-plugin version 4.0.0.
>>> >
>>> >Has anyone encountered this problem and found a fix?
>>> >
>>> >Thanks,
>>> >Lukasz
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >--
>>> >View this message in context: 
>>> >http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html
>>> >Sent from the Apache Spark Developers List mailing list archive at 
>>> >Nabble.com.
>>> >
>>> >-
>>> >To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> >For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>>
>>>
>>>
>>>
>>>
>>
>>
>


Re: [build system] additional jenkins downtime next thursday

2016-02-25 Thread shane knapp
alright, the update is done and worker-08 rebooted.  we're back up and
building already!

On Thu, Feb 25, 2016 at 8:15 AM, shane knapp  wrote:
> this is happening now.
>
> On Wed, Feb 24, 2016 at 6:08 PM, shane knapp  wrote:
>> the security update has been released, and it's a doozy!
>>
>> https://wiki.jenkins-ci.org/display/SECURITY/Security+Advisory+2016-02-24
>>
>> i will be putting jenkins in to quiet mode ~7am PST tomorrow morning
>> for the upgrade, and expect to be back up and building by 9am PST at
>> the latest.
>>
>> amp-jenkins-worker-08 will also be getting a reboot to test out a fix for:
>> https://github.com/apache/spark/pull/9893
>>
>> shane
>>
>> On Wed, Feb 17, 2016 at 10:47 AM, shane knapp  wrote:
>>> the security release has been delayed until next wednesday morning,
>>> and i'll be doing the upgrade first thing thursday morning.
>>>
>>> i'll update everyone when i get more information.
>>>
>>> thanks!
>>>
>>> shane
>>>
>>> On Thu, Feb 11, 2016 at 10:19 AM, shane knapp  wrote:
 there's a big security patch coming out next week, and i'd like to
 upgrade our jenkins installation so that we're covered.  it'll be
 around 8am, again, and i'll send out more details about the upgrade
 when i get them.

 thanks!

 shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [build system] additional jenkins downtime next thursday

2016-02-25 Thread shane knapp
this is happening now.

On Wed, Feb 24, 2016 at 6:08 PM, shane knapp  wrote:
> the security update has been released, and it's a doozy!
>
> https://wiki.jenkins-ci.org/display/SECURITY/Security+Advisory+2016-02-24
>
> i will be putting jenkins in to quiet mode ~7am PST tomorrow morning
> for the upgrade, and expect to be back up and building by 9am PST at
> the latest.
>
> amp-jenkins-worker-08 will also be getting a reboot to test out a fix for:
> https://github.com/apache/spark/pull/9893
>
> shane
>
> On Wed, Feb 17, 2016 at 10:47 AM, shane knapp  wrote:
>> the security release has been delayed until next wednesday morning,
>> and i'll be doing the upgrade first thing thursday morning.
>>
>> i'll update everyone when i get more information.
>>
>> thanks!
>>
>> shane
>>
>> On Thu, Feb 11, 2016 at 10:19 AM, shane knapp  wrote:
>>> there's a big security patch coming out next week, and i'd like to
>>> upgrade our jenkins installation so that we're covered.  it'll be
>>> around 8am, again, and i'll send out more details about the upgrade
>>> when i get them.
>>>
>>> thanks!
>>>
>>> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Yin Yang
In yarn/.classpath , I see:
  

Here is the command I used:

build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
-Dhadoop.version=2.7.0 package -DskipTests eclipse:eclipse

FYI

On Thu, Feb 25, 2016 at 6:13 AM, Łukasz Gieroń  wrote:

> I've just checked, and "mvn eclipse:eclipse" generates incorrect projects
> as well.
>
>
> On Thu, Feb 25, 2016 at 3:04 PM, Allen Zhang 
> wrote:
>
>> why not use maven
>>
>>
>>
>>
>>
>>
>> At 2016-02-25 21:55:49, "lgieron"  wrote:
>> >The Spark projects generated by sbt eclipse plugin have incorrect dependent
>> >projects (as visible on Properties -> Java Build Path -> Projects tab). All
>> >dependent project are missing the "_2.11" suffix (for example, it's
>> >"spark-core" instead of correct "spark-core_2.11"). This of course causes
>> >the build to fail.
>> >
>> >I am using sbteclipse-plugin version 4.0.0.
>> >
>> >Has anyone encountered this problem and found a fix?
>> >
>> >Thanks,
>> >Lukasz
>> >
>> >
>> >
>> >
>> >
>> >--
>> >View this message in context: 
>> >http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html
>> >Sent from the Apache Spark Developers List mailing list archive at 
>> >Nabble.com.
>> >
>> >-
>> >To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>>
>>
>>
>>
>
>


Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Allen Zhang

well, I am using IDEA to import the code base.





At 2016-02-25 22:13:11, "Łukasz Gieroń"  wrote:

I've just checked, and "mvn eclipse:eclipse" generates incorrect projects as 
well.



On Thu, Feb 25, 2016 at 3:04 PM, Allen Zhang  wrote:

why not use maven








At 2016-02-25 21:55:49, "lgieron"  wrote:
>The Spark projects generated by sbt eclipse plugin have incorrect dependent
>projects (as visible on Properties -> Java Build Path -> Projects tab). All
>dependent project are missing the "_2.11" suffix (for example, it's
>"spark-core" instead of correct "spark-core_2.11"). This of course causes
>the build to fail.
>
>I am using sbteclipse-plugin version 4.0.0.
>
>Has anyone encountered this problem and found a fix?
>
>Thanks,
>Lukasz
>
>
>
>
>
>--
>View this message in context: 
>http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html
>Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>For additional commands, e-mail: dev-h...@spark.apache.org
>





 




Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Łukasz Gieroń
I've just checked, and "mvn eclipse:eclipse" generates incorrect projects
as well.

On Thu, Feb 25, 2016 at 3:04 PM, Allen Zhang  wrote:

> why not use maven
>
>
>
>
>
>
> At 2016-02-25 21:55:49, "lgieron"  wrote:
> >The Spark projects generated by sbt eclipse plugin have incorrect dependent
> >projects (as visible on Properties -> Java Build Path -> Projects tab). All
> >dependent project are missing the "_2.11" suffix (for example, it's
> >"spark-core" instead of correct "spark-core_2.11"). This of course causes
> >the build to fail.
> >
> >I am using sbteclipse-plugin version 4.0.0.
> >
> >Has anyone encountered this problem and found a fix?
> >
> >Thanks,
> >Lukasz
> >
> >
> >
> >
> >
> >--
> >View this message in context: 
> >http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html
> >Sent from the Apache Spark Developers List mailing list archive at 
> >Nabble.com.
> >
> >-
> >To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>
>
>
>


Re:Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Allen Zhang
dev/change-scala-version 2.10 may help you?








At 2016-02-25 21:55:49, "lgieron"  wrote:
>The Spark projects generated by sbt eclipse plugin have incorrect dependent
>projects (as visible on Properties -> Java Build Path -> Projects tab). All
>dependent project are missing the "_2.11" suffix (for example, it's
>"spark-core" instead of correct "spark-core_2.11"). This of course causes
>the build to fail.
>
>I am using sbteclipse-plugin version 4.0.0.
>
>Has anyone encountered this problem and found a fix?
>
>Thanks,
>Lukasz
>
>
>
>
>
>--
>View this message in context: 
>http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html
>Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>For additional commands, e-mail: dev-h...@spark.apache.org
>


Re:Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Allen Zhang
why not use maven








At 2016-02-25 21:55:49, "lgieron"  wrote:
>The Spark projects generated by sbt eclipse plugin have incorrect dependent
>projects (as visible on Properties -> Java Build Path -> Projects tab). All
>dependent project are missing the "_2.11" suffix (for example, it's
>"spark-core" instead of correct "spark-core_2.11"). This of course causes
>the build to fail.
>
>I am using sbteclipse-plugin version 4.0.0.
>
>Has anyone encountered this problem and found a fix?
>
>Thanks,
>Lukasz
>
>
>
>
>
>--
>View this message in context: 
>http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html
>Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>For additional commands, e-mail: dev-h...@spark.apache.org
>


Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread lgieron
The Spark projects generated by sbt eclipse plugin have incorrect dependent
projects (as visible on Properties -> Java Build Path -> Projects tab). All
dependent project are missing the "_2.11" suffix (for example, it's
"spark-core" instead of correct "spark-core_2.11"). This of course causes
the build to fail.

I am using sbteclipse-plugin version 4.0.0.

Has anyone encountered this problem and found a fix?

Thanks,
Lukasz





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Bug in DiskBlockManager subDirs logic?

2016-02-25 Thread Zee Chen
Hi,

I am debugging a situation where SortShuffleWriter sometimes fail to
create a file, with the following stack trace:

16/02/23 11:48:46 ERROR Executor: Exception in task 13.0 in stage
47827.0 (TID 1367089)
java.io.FileNotFoundException:
/tmp/spark-9dd8dca9-6803-4c6c-bb6a-0e9c0111837c/executor-129dfdb8-9422-4668-989e-e789703526ad/blockmgr-dda6e340-7859-468f-b493-04e4162d341a/00/temp_shuffle_69fe1673-9ff2-462b-92b8-683d04669aad
(No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


I checked the linux file system (ext4) and saw the /00/ subdir is
missing. I went through the heap dump of the
CoarseGrainedExecutorBackend jvm proc and found that
DiskBlockManager's subDirs list had more non-null 2-hex subdirs than
present on the file system! As a test I created all 64 2-hex subdirs
by hand and then the problem went away.

So had anybody else seen this problem? Looking at the relevant logic
in DiskBlockManager and it hasn't changed much since the fix to
https://issues.apache.org/jira/browse/SPARK-6468

My configuration:
spark-1.5.1, hadoop-2.6.0, standalone, oracle jdk8u60

Thanks,
Zee

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org