Re: [discuss] DataFrame vs Dataset in Spark 2.0
The join and joinWith are just two different join semantics, and is not about Dataset vs DataFrame. join is the relational join, where fields are flattened; joinWith is more like a tuple join, where the output has two fields that are nested. So you can do Dataset[A] joinWith Dataset[B] = Dataset[(A, B)] DataFrame[A] joinWith DataFrame[B] = Dataset[(Row, Row)] Dataset[A] join Dataset[B] = Dataset[Row] DataFrame[A] join DataFrame[B] = Dataset[Row] On Thu, Feb 25, 2016 at 11:37 PM, Sun, Ruiwrote: > Vote for option 2. > > Source compatibility and binary compatibility are very important from > user’s perspective. > > It ‘s unfair for Java developers that they don’t have DataFrame > abstraction. As you said, sometimes it is more natural to think about > DataFrame. > > > > I am wondering if conceptually there is slight subtle difference between > DataFrame and Dataset[Row]? For example, > > Dataset[T] joinWith Dataset[U] produces Dataset[(T, U)] > > So, > > Dataset[Row] joinWith Dataset[Row] produces Dataset[(Row, Row)] > > > > While > > DataFrame join DataFrame is still DataFrame of Row? > > > > *From:* Reynold Xin [mailto:r...@databricks.com] > *Sent:* Friday, February 26, 2016 8:52 AM > *To:* Koert Kuipers > *Cc:* dev@spark.apache.org > *Subject:* Re: [discuss] DataFrame vs Dataset in Spark 2.0 > > > > Yes - and that's why source compatibility is broken. > > > > Note that it is not just a "convenience" thing. Conceptually DataFrame is > a Dataset[Row], and for some developers it is more natural to think about > "DataFrame" rather than "Dataset[Row]". > > > > If we were in C++, DataFrame would've been a type alias for Dataset[Row] > too, and some methods would return DataFrame (e.g. sql method). > > > > > > > > On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers wrote: > > since a type alias is purely a convenience thing for the scala compiler, > does option 1 mean that the concept of DataFrame ceases to exist from a > java perspective, and they will have to refer to Dataset? > > > > On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin wrote: > > When we first introduced Dataset in 1.6 as an experimental API, we wanted > to merge Dataset/DataFrame but couldn't because we didn't want to break the > pre-existing DataFrame API (e.g. map function should return Dataset, rather > than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame > and Dataset. > > > > Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two > ways to implement this: > > > > Option 1. Make DataFrame a type alias for Dataset[Row] > > > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > > > > > I'm wondering what you think about this. The pros and cons I can think of > are: > > > > > > Option 1. Make DataFrame a type alias for Dataset[Row] > > > > + Cleaner conceptually, especially in Scala. It will be very clear what > libraries or applications need to do, and we won't see type mismatches > (e.g. a function expects DataFrame, but user is passing in Dataset[Row] > > + A lot less code > > - Breaks source compatibility for the DataFrame API in Java, and binary > compatibility for Scala/Java > > > > > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > > > The pros/cons are basically the inverse of Option 1. > > > > + In most cases, can maintain source compatibility for the DataFrame API > in Java, and binary compatibility for Scala/Java > > - A lot more code (1000+ loc) > > - Less cleaner, and can be confusing when users pass in a Dataset[Row] > into a function that expects a DataFrame > > > > > > The concerns are mostly with Scala/Java. For Python, it is very easy to > maintain source compatibility for both (there is no concept of binary > compatibility), and for R, we are only supporting the DataFrame operations > anyway because that's more familiar interface for R users outside of Spark. > > > > > > > > >
RE: [discuss] DataFrame vs Dataset in Spark 2.0
Vote for option 2. Source compatibility and binary compatibility are very important from user’s perspective. It ‘s unfair for Java developers that they don’t have DataFrame abstraction. As you said, sometimes it is more natural to think about DataFrame. I am wondering if conceptually there is slight subtle difference between DataFrame and Dataset[Row]? For example, Dataset[T] joinWith Dataset[U] produces Dataset[(T, U)] So, Dataset[Row] joinWith Dataset[Row] produces Dataset[(Row, Row)] While DataFrame join DataFrame is still DataFrame of Row? From: Reynold Xin [mailto:r...@databricks.com] Sent: Friday, February 26, 2016 8:52 AM To: Koert KuipersCc: dev@spark.apache.org Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0 Yes - and that's why source compatibility is broken. Note that it is not just a "convenience" thing. Conceptually DataFrame is a Dataset[Row], and for some developers it is more natural to think about "DataFrame" rather than "Dataset[Row]". If we were in C++, DataFrame would've been a type alias for Dataset[Row] too, and some methods would return DataFrame (e.g. sql method). On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers > wrote: since a type alias is purely a convenience thing for the scala compiler, does option 1 mean that the concept of DataFrame ceases to exist from a java perspective, and they will have to refer to Dataset? On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin > wrote: When we first introduced Dataset in 1.6 as an experimental API, we wanted to merge Dataset/DataFrame but couldn't because we didn't want to break the pre-existing DataFrame API (e.g. map function should return Dataset, rather than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and Dataset. Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways to implement this: Option 1. Make DataFrame a type alias for Dataset[Row] Option 2. DataFrame as a concrete class that extends Dataset[Row] I'm wondering what you think about this. The pros and cons I can think of are: Option 1. Make DataFrame a type alias for Dataset[Row] + Cleaner conceptually, especially in Scala. It will be very clear what libraries or applications need to do, and we won't see type mismatches (e.g. a function expects DataFrame, but user is passing in Dataset[Row] + A lot less code - Breaks source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java Option 2. DataFrame as a concrete class that extends Dataset[Row] The pros/cons are basically the inverse of Option 1. + In most cases, can maintain source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java - A lot more code (1000+ loc) - Less cleaner, and can be confusing when users pass in a Dataset[Row] into a function that expects a DataFrame The concerns are mostly with Scala/Java. For Python, it is very easy to maintain source compatibility for both (there is no concept of binary compatibility), and for R, we are only supporting the DataFrame operations anyway because that's more familiar interface for R users outside of Spark.
Re: [discuss] DataFrame vs Dataset in Spark 2.0
Yes - and that's why source compatibility is broken. Note that it is not just a "convenience" thing. Conceptually DataFrame is a Dataset[Row], and for some developers it is more natural to think about "DataFrame" rather than "Dataset[Row]". If we were in C++, DataFrame would've been a type alias for Dataset[Row] too, and some methods would return DataFrame (e.g. sql method). On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuiperswrote: > since a type alias is purely a convenience thing for the scala compiler, > does option 1 mean that the concept of DataFrame ceases to exist from a > java perspective, and they will have to refer to Dataset? > > On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin wrote: > >> When we first introduced Dataset in 1.6 as an experimental API, we wanted >> to merge Dataset/DataFrame but couldn't because we didn't want to break the >> pre-existing DataFrame API (e.g. map function should return Dataset, rather >> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame >> and Dataset. >> >> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are >> two ways to implement this: >> >> Option 1. Make DataFrame a type alias for Dataset[Row] >> >> Option 2. DataFrame as a concrete class that extends Dataset[Row] >> >> >> I'm wondering what you think about this. The pros and cons I can think of >> are: >> >> >> Option 1. Make DataFrame a type alias for Dataset[Row] >> >> + Cleaner conceptually, especially in Scala. It will be very clear what >> libraries or applications need to do, and we won't see type mismatches >> (e.g. a function expects DataFrame, but user is passing in Dataset[Row] >> + A lot less code >> - Breaks source compatibility for the DataFrame API in Java, and binary >> compatibility for Scala/Java >> >> >> Option 2. DataFrame as a concrete class that extends Dataset[Row] >> >> The pros/cons are basically the inverse of Option 1. >> >> + In most cases, can maintain source compatibility for the DataFrame API >> in Java, and binary compatibility for Scala/Java >> - A lot more code (1000+ loc) >> - Less cleaner, and can be confusing when users pass in a Dataset[Row] >> into a function that expects a DataFrame >> >> >> The concerns are mostly with Scala/Java. For Python, it is very easy to >> maintain source compatibility for both (there is no concept of binary >> compatibility), and for R, we are only supporting the DataFrame operations >> anyway because that's more familiar interface for R users outside of Spark. >> >> >> >
Re: [discuss] DataFrame vs Dataset in Spark 2.0
since a type alias is purely a convenience thing for the scala compiler, does option 1 mean that the concept of DataFrame ceases to exist from a java perspective, and they will have to refer to Dataset? On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xinwrote: > When we first introduced Dataset in 1.6 as an experimental API, we wanted > to merge Dataset/DataFrame but couldn't because we didn't want to break the > pre-existing DataFrame API (e.g. map function should return Dataset, rather > than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame > and Dataset. > > Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two > ways to implement this: > > Option 1. Make DataFrame a type alias for Dataset[Row] > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > > I'm wondering what you think about this. The pros and cons I can think of > are: > > > Option 1. Make DataFrame a type alias for Dataset[Row] > > + Cleaner conceptually, especially in Scala. It will be very clear what > libraries or applications need to do, and we won't see type mismatches > (e.g. a function expects DataFrame, but user is passing in Dataset[Row] > + A lot less code > - Breaks source compatibility for the DataFrame API in Java, and binary > compatibility for Scala/Java > > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > The pros/cons are basically the inverse of Option 1. > > + In most cases, can maintain source compatibility for the DataFrame API > in Java, and binary compatibility for Scala/Java > - A lot more code (1000+ loc) > - Less cleaner, and can be confusing when users pass in a Dataset[Row] > into a function that expects a DataFrame > > > The concerns are mostly with Scala/Java. For Python, it is very easy to > maintain source compatibility for both (there is no concept of binary > compatibility), and for R, we are only supporting the DataFrame operations > anyway because that's more familiar interface for R users outside of Spark. > > >
Re: [discuss] DataFrame vs Dataset in Spark 2.0
It might make sense, but this option seems to carry all the cons of Option 2, and yet doesn't provide compatibility for Java? On Thu, Feb 25, 2016 at 3:31 PM, Michael Malakwrote: > Would it make sense (in terms of feasibility, code organization, and > politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra > lines to a Java compatibility layer/class? > > > -- > *From:* Reynold Xin > *To:* "dev@spark.apache.org" > *Sent:* Thursday, February 25, 2016 4:23 PM > *Subject:* [discuss] DataFrame vs Dataset in Spark 2.0 > > When we first introduced Dataset in 1.6 as an experimental API, we wanted > to merge Dataset/DataFrame but couldn't because we didn't want to break the > pre-existing DataFrame API (e.g. map function should return Dataset, rather > than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame > and Dataset. > > Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two > ways to implement this: > > Option 1. Make DataFrame a type alias for Dataset[Row] > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > > I'm wondering what you think about this. The pros and cons I can think of > are: > > > Option 1. Make DataFrame a type alias for Dataset[Row] > > + Cleaner conceptually, especially in Scala. It will be very clear what > libraries or applications need to do, and we won't see type mismatches > (e.g. a function expects DataFrame, but user is passing in Dataset[Row] > + A lot less code > - Breaks source compatibility for the DataFrame API in Java, and binary > compatibility for Scala/Java > > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > The pros/cons are basically the inverse of Option 1. > > + In most cases, can maintain source compatibility for the DataFrame API > in Java, and binary compatibility for Scala/Java > - A lot more code (1000+ loc) > - Less cleaner, and can be confusing when users pass in a Dataset[Row] > into a function that expects a DataFrame > > > The concerns are mostly with Scala/Java. For Python, it is very easy to > maintain source compatibility for both (there is no concept of binary > compatibility), and for R, we are only supporting the DataFrame operations > anyway because that's more familiar interface for R users outside of Spark. > > > > >
Re: [discuss] DataFrame vs Dataset in Spark 2.0
Would it make sense (in terms of feasibility, code organization, and politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra lines to a Java compatibility layer/class? From: Reynold XinTo: "dev@spark.apache.org" Sent: Thursday, February 25, 2016 4:23 PM Subject: [discuss] DataFrame vs Dataset in Spark 2.0 When we first introduced Dataset in 1.6 as an experimental API, we wanted to merge Dataset/DataFrame but couldn't because we didn't want to break the pre-existing DataFrame API (e.g. map function should return Dataset, rather than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and Dataset. Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways to implement this: Option 1. Make DataFrame a type alias for Dataset[Row] Option 2. DataFrame as a concrete class that extends Dataset[Row] I'm wondering what you think about this. The pros and cons I can think of are: Option 1. Make DataFrame a type alias for Dataset[Row] + Cleaner conceptually, especially in Scala. It will be very clear what libraries or applications need to do, and we won't see type mismatches (e.g. a function expects DataFrame, but user is passing in Dataset[Row] + A lot less code- Breaks source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java Option 2. DataFrame as a concrete class that extends Dataset[Row] The pros/cons are basically the inverse of Option 1. + In most cases, can maintain source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java- A lot more code (1000+ loc)- Less cleaner, and can be confusing when users pass in a Dataset[Row] into a function that expects a DataFrame The concerns are mostly with Scala/Java. For Python, it is very easy to maintain source compatibility for both (there is no concept of binary compatibility), and for R, we are only supporting the DataFrame operations anyway because that's more familiar interface for R users outside of Spark.
Re: [discuss] DataFrame vs Dataset in Spark 2.0
vote for Option 1. 1) Since 2.0 is major API, we are expecting some API changes, 2) It helps long term code base maintenance with short term pain on Java side 3) Not quite sure how large the code base is using Java DataFrame APIs. On Thu, Feb 25, 2016 at 3:23 PM, Reynold Xinwrote: > When we first introduced Dataset in 1.6 as an experimental API, we wanted > to merge Dataset/DataFrame but couldn't because we didn't want to break the > pre-existing DataFrame API (e.g. map function should return Dataset, rather > than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame > and Dataset. > > Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two > ways to implement this: > > Option 1. Make DataFrame a type alias for Dataset[Row] > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > > I'm wondering what you think about this. The pros and cons I can think of > are: > > > Option 1. Make DataFrame a type alias for Dataset[Row] > > + Cleaner conceptually, especially in Scala. It will be very clear what > libraries or applications need to do, and we won't see type mismatches > (e.g. a function expects DataFrame, but user is passing in Dataset[Row] > + A lot less code > - Breaks source compatibility for the DataFrame API in Java, and binary > compatibility for Scala/Java > > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > The pros/cons are basically the inverse of Option 1. > > + In most cases, can maintain source compatibility for the DataFrame API > in Java, and binary compatibility for Scala/Java > - A lot more code (1000+ loc) > - Less cleaner, and can be confusing when users pass in a Dataset[Row] > into a function that expects a DataFrame > > > The concerns are mostly with Scala/Java. For Python, it is very easy to > maintain source compatibility for both (there is no concept of binary > compatibility), and for R, we are only supporting the DataFrame operations > anyway because that's more familiar interface for R users outside of Spark. > > >
[discuss] DataFrame vs Dataset in Spark 2.0
When we first introduced Dataset in 1.6 as an experimental API, we wanted to merge Dataset/DataFrame but couldn't because we didn't want to break the pre-existing DataFrame API (e.g. map function should return Dataset, rather than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and Dataset. Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways to implement this: Option 1. Make DataFrame a type alias for Dataset[Row] Option 2. DataFrame as a concrete class that extends Dataset[Row] I'm wondering what you think about this. The pros and cons I can think of are: Option 1. Make DataFrame a type alias for Dataset[Row] + Cleaner conceptually, especially in Scala. It will be very clear what libraries or applications need to do, and we won't see type mismatches (e.g. a function expects DataFrame, but user is passing in Dataset[Row] + A lot less code - Breaks source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java Option 2. DataFrame as a concrete class that extends Dataset[Row] The pros/cons are basically the inverse of Option 1. + In most cases, can maintain source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java - A lot more code (1000+ loc) - Less cleaner, and can be confusing when users pass in a Dataset[Row] into a function that expects a DataFrame The concerns are mostly with Scala/Java. For Python, it is very easy to maintain source compatibility for both (there is no concept of binary compatibility), and for R, we are only supporting the DataFrame operations anyway because that's more familiar interface for R users outside of Spark.
Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"
Thank you, your version of the mvn invocation (as opposed to mine bare "mvn eclipse:eclipse") worked perfectly. On Thu, Feb 25, 2016 at 3:22 PM, Yin Yangwrote: > In yarn/.classpath , I see: > > > Here is the command I used: > > build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 > -Dhadoop.version=2.7.0 package -DskipTests eclipse:eclipse > > FYI > > On Thu, Feb 25, 2016 at 6:13 AM, Łukasz Gieroń wrote: > >> I've just checked, and "mvn eclipse:eclipse" generates incorrect projects >> as well. >> >> >> On Thu, Feb 25, 2016 at 3:04 PM, Allen Zhang >> wrote: >> >>> why not use maven >>> >>> >>> >>> >>> >>> >>> At 2016-02-25 21:55:49, "lgieron" wrote: >>> >The Spark projects generated by sbt eclipse plugin have incorrect dependent >>> >projects (as visible on Properties -> Java Build Path -> Projects tab). All >>> >dependent project are missing the "_2.11" suffix (for example, it's >>> >"spark-core" instead of correct "spark-core_2.11"). This of course causes >>> >the build to fail. >>> > >>> >I am using sbteclipse-plugin version 4.0.0. >>> > >>> >Has anyone encountered this problem and found a fix? >>> > >>> >Thanks, >>> >Lukasz >>> > >>> > >>> > >>> > >>> > >>> >-- >>> >View this message in context: >>> >http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html >>> >Sent from the Apache Spark Developers List mailing list archive at >>> >Nabble.com. >>> > >>> >- >>> >To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> >For additional commands, e-mail: dev-h...@spark.apache.org >>> > >>> >>> >>> >>> >>> >> >> >
Re: [build system] additional jenkins downtime next thursday
alright, the update is done and worker-08 rebooted. we're back up and building already! On Thu, Feb 25, 2016 at 8:15 AM, shane knappwrote: > this is happening now. > > On Wed, Feb 24, 2016 at 6:08 PM, shane knapp wrote: >> the security update has been released, and it's a doozy! >> >> https://wiki.jenkins-ci.org/display/SECURITY/Security+Advisory+2016-02-24 >> >> i will be putting jenkins in to quiet mode ~7am PST tomorrow morning >> for the upgrade, and expect to be back up and building by 9am PST at >> the latest. >> >> amp-jenkins-worker-08 will also be getting a reboot to test out a fix for: >> https://github.com/apache/spark/pull/9893 >> >> shane >> >> On Wed, Feb 17, 2016 at 10:47 AM, shane knapp wrote: >>> the security release has been delayed until next wednesday morning, >>> and i'll be doing the upgrade first thing thursday morning. >>> >>> i'll update everyone when i get more information. >>> >>> thanks! >>> >>> shane >>> >>> On Thu, Feb 11, 2016 at 10:19 AM, shane knapp wrote: there's a big security patch coming out next week, and i'd like to upgrade our jenkins installation so that we're covered. it'll be around 8am, again, and i'll send out more details about the upgrade when i get them. thanks! shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [build system] additional jenkins downtime next thursday
this is happening now. On Wed, Feb 24, 2016 at 6:08 PM, shane knappwrote: > the security update has been released, and it's a doozy! > > https://wiki.jenkins-ci.org/display/SECURITY/Security+Advisory+2016-02-24 > > i will be putting jenkins in to quiet mode ~7am PST tomorrow morning > for the upgrade, and expect to be back up and building by 9am PST at > the latest. > > amp-jenkins-worker-08 will also be getting a reboot to test out a fix for: > https://github.com/apache/spark/pull/9893 > > shane > > On Wed, Feb 17, 2016 at 10:47 AM, shane knapp wrote: >> the security release has been delayed until next wednesday morning, >> and i'll be doing the upgrade first thing thursday morning. >> >> i'll update everyone when i get more information. >> >> thanks! >> >> shane >> >> On Thu, Feb 11, 2016 at 10:19 AM, shane knapp wrote: >>> there's a big security patch coming out next week, and i'd like to >>> upgrade our jenkins installation so that we're covered. it'll be >>> around 8am, again, and i'll send out more details about the upgrade >>> when i get them. >>> >>> thanks! >>> >>> shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"
In yarn/.classpath , I see: Here is the command I used: build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.0 package -DskipTests eclipse:eclipse FYI On Thu, Feb 25, 2016 at 6:13 AM, Łukasz Gierońwrote: > I've just checked, and "mvn eclipse:eclipse" generates incorrect projects > as well. > > > On Thu, Feb 25, 2016 at 3:04 PM, Allen Zhang > wrote: > >> why not use maven >> >> >> >> >> >> >> At 2016-02-25 21:55:49, "lgieron" wrote: >> >The Spark projects generated by sbt eclipse plugin have incorrect dependent >> >projects (as visible on Properties -> Java Build Path -> Projects tab). All >> >dependent project are missing the "_2.11" suffix (for example, it's >> >"spark-core" instead of correct "spark-core_2.11"). This of course causes >> >the build to fail. >> > >> >I am using sbteclipse-plugin version 4.0.0. >> > >> >Has anyone encountered this problem and found a fix? >> > >> >Thanks, >> >Lukasz >> > >> > >> > >> > >> > >> >-- >> >View this message in context: >> >http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html >> >Sent from the Apache Spark Developers List mailing list archive at >> >Nabble.com. >> > >> >- >> >To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >For additional commands, e-mail: dev-h...@spark.apache.org >> > >> >> >> >> >> > >
Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"
well, I am using IDEA to import the code base. At 2016-02-25 22:13:11, "Łukasz Gieroń"wrote: I've just checked, and "mvn eclipse:eclipse" generates incorrect projects as well. On Thu, Feb 25, 2016 at 3:04 PM, Allen Zhang wrote: why not use maven At 2016-02-25 21:55:49, "lgieron" wrote: >The Spark projects generated by sbt eclipse plugin have incorrect dependent >projects (as visible on Properties -> Java Build Path -> Projects tab). All >dependent project are missing the "_2.11" suffix (for example, it's >"spark-core" instead of correct "spark-core_2.11"). This of course causes >the build to fail. > >I am using sbteclipse-plugin version 4.0.0. > >Has anyone encountered this problem and found a fix? > >Thanks, >Lukasz > > > > > >-- >View this message in context: >http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html >Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > >- >To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >For additional commands, e-mail: dev-h...@spark.apache.org >
Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"
I've just checked, and "mvn eclipse:eclipse" generates incorrect projects as well. On Thu, Feb 25, 2016 at 3:04 PM, Allen Zhangwrote: > why not use maven > > > > > > > At 2016-02-25 21:55:49, "lgieron" wrote: > >The Spark projects generated by sbt eclipse plugin have incorrect dependent > >projects (as visible on Properties -> Java Build Path -> Projects tab). All > >dependent project are missing the "_2.11" suffix (for example, it's > >"spark-core" instead of correct "spark-core_2.11"). This of course causes > >the build to fail. > > > >I am using sbteclipse-plugin version 4.0.0. > > > >Has anyone encountered this problem and found a fix? > > > >Thanks, > >Lukasz > > > > > > > > > > > >-- > >View this message in context: > >http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html > >Sent from the Apache Spark Developers List mailing list archive at > >Nabble.com. > > > >- > >To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >For additional commands, e-mail: dev-h...@spark.apache.org > > > > > > >
Re:Eclipse: Wrong project dependencies in generated by "sbt eclipse"
dev/change-scala-version 2.10 may help you? At 2016-02-25 21:55:49, "lgieron"wrote: >The Spark projects generated by sbt eclipse plugin have incorrect dependent >projects (as visible on Properties -> Java Build Path -> Projects tab). All >dependent project are missing the "_2.11" suffix (for example, it's >"spark-core" instead of correct "spark-core_2.11"). This of course causes >the build to fail. > >I am using sbteclipse-plugin version 4.0.0. > >Has anyone encountered this problem and found a fix? > >Thanks, >Lukasz > > > > > >-- >View this message in context: >http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html >Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > >- >To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >For additional commands, e-mail: dev-h...@spark.apache.org >
Re:Eclipse: Wrong project dependencies in generated by "sbt eclipse"
why not use maven At 2016-02-25 21:55:49, "lgieron"wrote: >The Spark projects generated by sbt eclipse plugin have incorrect dependent >projects (as visible on Properties -> Java Build Path -> Projects tab). All >dependent project are missing the "_2.11" suffix (for example, it's >"spark-core" instead of correct "spark-core_2.11"). This of course causes >the build to fail. > >I am using sbteclipse-plugin version 4.0.0. > >Has anyone encountered this problem and found a fix? > >Thanks, >Lukasz > > > > > >-- >View this message in context: >http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html >Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > >- >To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >For additional commands, e-mail: dev-h...@spark.apache.org >
Eclipse: Wrong project dependencies in generated by "sbt eclipse"
The Spark projects generated by sbt eclipse plugin have incorrect dependent projects (as visible on Properties -> Java Build Path -> Projects tab). All dependent project are missing the "_2.11" suffix (for example, it's "spark-core" instead of correct "spark-core_2.11"). This of course causes the build to fail. I am using sbteclipse-plugin version 4.0.0. Has anyone encountered this problem and found a fix? Thanks, Lukasz -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Wrong-project-dependencies-in-generated-by-sbt-eclipse-tp16436.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Bug in DiskBlockManager subDirs logic?
Hi, I am debugging a situation where SortShuffleWriter sometimes fail to create a file, with the following stack trace: 16/02/23 11:48:46 ERROR Executor: Exception in task 13.0 in stage 47827.0 (TID 1367089) java.io.FileNotFoundException: /tmp/spark-9dd8dca9-6803-4c6c-bb6a-0e9c0111837c/executor-129dfdb8-9422-4668-989e-e789703526ad/blockmgr-dda6e340-7859-468f-b493-04e4162d341a/00/temp_shuffle_69fe1673-9ff2-462b-92b8-683d04669aad (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) I checked the linux file system (ext4) and saw the /00/ subdir is missing. I went through the heap dump of the CoarseGrainedExecutorBackend jvm proc and found that DiskBlockManager's subDirs list had more non-null 2-hex subdirs than present on the file system! As a test I created all 64 2-hex subdirs by hand and then the problem went away. So had anybody else seen this problem? Looking at the relevant logic in DiskBlockManager and it hasn't changed much since the fix to https://issues.apache.org/jira/browse/SPARK-6468 My configuration: spark-1.5.1, hadoop-2.6.0, standalone, oracle jdk8u60 Thanks, Zee - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org