SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-06 Thread linxi zeng
Hi, all:
   As recorded in https://issues.apache.org/jira/browse/SPARK-16408, when
using Spark-sql to execute sql like:
   add file hdfs://xxx/user/test;
   If the HDFS path( hdfs://xxx/user/test) is a directory, then we will get
an exception like:

org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a
directory and recursive is not turned on.
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
   at
org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
   at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
   at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
   at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)


   I think we should add an parameter (spark.input.dir.recursive) to
control the value of recursive, and make this parameter works by modify
some code, like:

diff --git
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
index 6b16d59..3be8553 100644
---
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
+++
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
@@ -113,8 +113,9 @@ case class AddFile(path: String) extends
RunnableCommand {

   override def run(sqlContext: SQLContext): Seq[Row] = {
 val hiveContext = sqlContext.asInstanceOf[HiveContext]
+val recursive =
sqlContext.sparkContext.getConf.getBoolean("spark.input.dir.recursive",
false)
 hiveContext.runSqlHive(s"ADD FILE $path")
-hiveContext.sparkContext.addFile(path)
+hiveContext.sparkContext.addFile(path, recursive)
 Seq.empty[Row]
   }
 }


Re: Latest spark release in the 1.4 branch

2016-07-06 Thread Niranda Perera
Thanks Reynold

On Thu, Jul 7, 2016 at 11:40 AM, Reynold Xin  wrote:

> Yes definitely.
>
>
> On Wed, Jul 6, 2016 at 11:08 PM, Niranda Perera 
> wrote:
>
>> Thanks Reynold for the prompt response. Do you think we could use a
>> 1.4-branch latest build in a production environment?
>>
>>
>>
>> On Thu, Jul 7, 2016 at 11:33 AM, Reynold Xin  wrote:
>>
>>> I think last time I tried I had some trouble releasing it because the
>>> release scripts no longer work with branch-1.4. You can build from the
>>> branch yourself, but it might be better to upgrade to the later versions.
>>>
>>> On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera <
>>> niranda.per...@gmail.com> wrote:
>>>
 Hi guys,

 May I know if you have halted development in the Spark 1.4 branch? I
 see that there is a release tag for 1.4.2 but it was never released.

 Can we expect a 1.4.x bug fixing release anytime soon?

 Best
 --
 Niranda
 @n1r44 
 +94-71-554-8430
 https://pythagoreanscript.wordpress.com/

>>>
>>>
>>
>>
>> --
>> Niranda
>> @n1r44 
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>


-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/


Re: Latest spark release in the 1.4 branch

2016-07-06 Thread Reynold Xin
Yes definitely.


On Wed, Jul 6, 2016 at 11:08 PM, Niranda Perera 
wrote:

> Thanks Reynold for the prompt response. Do you think we could use a
> 1.4-branch latest build in a production environment?
>
>
>
> On Thu, Jul 7, 2016 at 11:33 AM, Reynold Xin  wrote:
>
>> I think last time I tried I had some trouble releasing it because the
>> release scripts no longer work with branch-1.4. You can build from the
>> branch yourself, but it might be better to upgrade to the later versions.
>>
>> On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera > > wrote:
>>
>>> Hi guys,
>>>
>>> May I know if you have halted development in the Spark 1.4 branch? I see
>>> that there is a release tag for 1.4.2 but it was never released.
>>>
>>> Can we expect a 1.4.x bug fixing release anytime soon?
>>>
>>> Best
>>> --
>>> Niranda
>>> @n1r44 
>>> +94-71-554-8430
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>
>
> --
> Niranda
> @n1r44 
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>


Re: Latest spark release in the 1.4 branch

2016-07-06 Thread Niranda Perera
Thanks Reynold for the prompt response. Do you think we could use a
1.4-branch latest build in a production environment?



On Thu, Jul 7, 2016 at 11:33 AM, Reynold Xin  wrote:

> I think last time I tried I had some trouble releasing it because the
> release scripts no longer work with branch-1.4. You can build from the
> branch yourself, but it might be better to upgrade to the later versions.
>
> On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera 
> wrote:
>
>> Hi guys,
>>
>> May I know if you have halted development in the Spark 1.4 branch? I see
>> that there is a release tag for 1.4.2 but it was never released.
>>
>> Can we expect a 1.4.x bug fixing release anytime soon?
>>
>> Best
>> --
>> Niranda
>> @n1r44 
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>


-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/


Re: Latest spark release in the 1.4 branch

2016-07-06 Thread Reynold Xin
I think last time I tried I had some trouble releasing it because the
release scripts no longer work with branch-1.4. You can build from the
branch yourself, but it might be better to upgrade to the later versions.

On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera 
wrote:

> Hi guys,
>
> May I know if you have halted development in the Spark 1.4 branch? I see
> that there is a release tag for 1.4.2 but it was never released.
>
> Can we expect a 1.4.x bug fixing release anytime soon?
>
> Best
> --
> Niranda
> @n1r44 
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>


Latest spark release in the 1.4 branch

2016-07-06 Thread Niranda Perera
Hi guys,

May I know if you have halted development in the Spark 1.4 branch? I see
that there is a release tag for 1.4.2 but it was never released.

Can we expect a 1.4.x bug fixing release anytime soon?

Best
-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/


Re: Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

2016-07-06 Thread nirandap
Hi Yash,

Yes, AFAIK, that is the expected behavior of the Overwrite mode.

I think you can use the following approaches if you want to perform a job
on each partitions
[1] for each partition in DF :
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1444
[2] run job in SC:
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/SparkContext.scala#L1818

Best

On Thu, Jul 7, 2016 at 7:40 AM, Yash Sharma [via Apache Spark Developers
List]  wrote:

> Hi All,
> While writing a partitioned data frame as partitioned text files I see
> that Spark deletes all available partitions while writing few new
> partitions.
>
> dataDF.write.partitionBy(“year”, “month”,
>> “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)
>
>
> Is this an expected behavior ?
>
> I have a past correction job which would overwrite couple of past
> partitions based on new arriving data. I would only want to remove those
> partitions.
>
> Is there a neater way to do that other than:
> - Find the partitions
> - Delete using Hadoop API's
> - Write DF in Append Mode
>
>
> Cheers
> Yash
>
>
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-tp18219.html
> To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1...@n3.nabble.com
> To unsubscribe from Apache Spark Developers List, click here
> 
> .
> NAML
> 
>



-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-tp18219p18220.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

2016-07-06 Thread Yash Sharma
Hi All,
While writing a partitioned data frame as partitioned text files I see that
Spark deletes all available partitions while writing few new partitions.

dataDF.write.partitionBy(“year”, “month”,
> “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)


Is this an expected behavior ?

I have a past correction job which would overwrite couple of past
partitions based on new arriving data. I would only want to remove those
partitions.

Is there a neater way to do that other than:
- Find the partitions
- Delete using Hadoop API's
- Write DF in Append Mode


Cheers
Yash


Stopping Spark executors

2016-07-06 Thread Mr rty ff
HiI like to recreate this bug 
https://issues.apache.org/jira/browse/SPARK-13979They talking about stopping 
Spark executors.Its not clear exactly how do I stop the executorsThanks
 

[PySPARK] - Py4J binary transfer survey

2016-07-06 Thread Holden Karau
Hi PySpark Devs,

The Py4j developer has a survey up for Py4J users -
https://github.com/bartdag/py4j/issues/237 it might be worth our time to
provide some input on how we are using and would like to be using Py4J if
binary transfer was improved. I'm happy to fill it out with my thoughts -
but if other people are interested too maybe we could work on a response
together?

Cheers,

Holden :)

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Ted Yu
Running the following command:
build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr
-Dhadoop.version=2.7.0 package

The build stopped with this test failure:

^[[31m- SPARK-9757 Persist Parquet relation with decimal column *** FAILED
***^[[0m


On Wed, Jul 6, 2016 at 6:25 AM, Sean Owen  wrote:

> Yeah we still have some blockers; I agree SPARK-16379 is a blocker
> which came up yesterday. We also have 5 existing blockers, all doc
> related:
>
> SPARK-14808 Spark MLlib, GraphX, SparkR 2.0 QA umbrella
> SPARK-14812 ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final,
> sealed audit
> SPARK-14816 Update MLlib, GraphX, SparkR websites for 2.0
> SPARK-14817 ML, Graph, R 2.0 QA: Programming guide update and migration
> guide
> SPARK-15124 R 2.0 QA: New R APIs and API docs
>
> While we'll almost surely need another RC, this one is well worth
> testing. It's much closer than even the last one.
>
> The sigs/hashes check out, and I successfully built with Ubuntu 16 /
> Java 8 with -Pyarn -Phadoop-2.7 -Phive. Tests pass except for:
>
> DirectKafkaStreamSuite:
> - offset recovery *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 196
> times over 10.028979855 seconds. Last failure message:
> strings.forall({
> ((x$1: Any) => DirectKafkaStreamSuite.collectedData.contains(x$1))
>   }) was false. (DirectKafkaStreamSuite.scala:250)
> - Direct Kafka stream report input information
>
> I know we've seen this before and tried to fix it but it may need another
> look.
>
> On Wed, Jul 6, 2016 at 6:35 AM, Reynold Xin  wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.0.0. The vote is open until Friday, July 8, 2016 at 23:00 PDT and
> passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.0.0
> > [ ] -1 Do not release this package because ...
> >
> >
> > The tag to be voted on is v2.0.0-rc2
> > (4a55b2326c8cf50f772907a8b73fd5e7b3d1aa06).
> >
> > This release candidate resolves ~2500 issues:
> > https://s.apache.org/spark-2.0.0-jira
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1189/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/
> >
> >
> > =
> > How can I help test this release?
> > =
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions from 1.x.
> >
> > ==
> > What justifies a -1 vote for this release?
> > ==
> > Critical bugs impacting major functionalities.
> >
> > Bugs already present in 1.x, missing features, or bugs related to new
> > features will not necessarily block this release. Note that historically
> > Spark documentation has been published on the website separately from the
> > main release so we do not need to block the release due to documentation
> > errors either.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Cody Koeninger
I know some usages of the 0.10 kafka connector will be broken until
https://github.com/apache/spark/pull/14026  is merged, but the 0.10
connector is a new feature, so not blocking.

Sean I'm assuming the DirectKafkaStreamSuite failure you saw was for
0.8?  I'll take another look at it.

On Wed, Jul 6, 2016 at 8:25 AM, Sean Owen  wrote:
> Yeah we still have some blockers; I agree SPARK-16379 is a blocker
> which came up yesterday. We also have 5 existing blockers, all doc
> related:
>
> SPARK-14808 Spark MLlib, GraphX, SparkR 2.0 QA umbrella
> SPARK-14812 ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final,
> sealed audit
> SPARK-14816 Update MLlib, GraphX, SparkR websites for 2.0
> SPARK-14817 ML, Graph, R 2.0 QA: Programming guide update and migration guide
> SPARK-15124 R 2.0 QA: New R APIs and API docs
>
> While we'll almost surely need another RC, this one is well worth
> testing. It's much closer than even the last one.
>
> The sigs/hashes check out, and I successfully built with Ubuntu 16 /
> Java 8 with -Pyarn -Phadoop-2.7 -Phive. Tests pass except for:
>
> DirectKafkaStreamSuite:
> - offset recovery *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 196
> times over 10.028979855 seconds. Last failure message:
> strings.forall({
> ((x$1: Any) => DirectKafkaStreamSuite.collectedData.contains(x$1))
>   }) was false. (DirectKafkaStreamSuite.scala:250)
> - Direct Kafka stream report input information
>
> I know we've seen this before and tried to fix it but it may need another 
> look.
>
> On Wed, Jul 6, 2016 at 6:35 AM, Reynold Xin  wrote:
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.0. The vote is open until Friday, July 8, 2016 at 23:00 PDT and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.0-rc2
>> (4a55b2326c8cf50f772907a8b73fd5e7b3d1aa06).
>>
>> This release candidate resolves ~2500 issues:
>> https://s.apache.org/spark-2.0.0-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1189/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/
>>
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions from 1.x.
>>
>> ==
>> What justifies a -1 vote for this release?
>> ==
>> Critical bugs impacting major functionalities.
>>
>> Bugs already present in 1.x, missing features, or bugs related to new
>> features will not necessarily block this release. Note that historically
>> Spark documentation has been published on the website separately from the
>> main release so we do not need to block the release due to documentation
>> errors either.
>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Sean Owen
Yeah we still have some blockers; I agree SPARK-16379 is a blocker
which came up yesterday. We also have 5 existing blockers, all doc
related:

SPARK-14808 Spark MLlib, GraphX, SparkR 2.0 QA umbrella
SPARK-14812 ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final,
sealed audit
SPARK-14816 Update MLlib, GraphX, SparkR websites for 2.0
SPARK-14817 ML, Graph, R 2.0 QA: Programming guide update and migration guide
SPARK-15124 R 2.0 QA: New R APIs and API docs

While we'll almost surely need another RC, this one is well worth
testing. It's much closer than even the last one.

The sigs/hashes check out, and I successfully built with Ubuntu 16 /
Java 8 with -Pyarn -Phadoop-2.7 -Phive. Tests pass except for:

DirectKafkaStreamSuite:
- offset recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 196
times over 10.028979855 seconds. Last failure message:
strings.forall({
((x$1: Any) => DirectKafkaStreamSuite.collectedData.contains(x$1))
  }) was false. (DirectKafkaStreamSuite.scala:250)
- Direct Kafka stream report input information

I know we've seen this before and tried to fix it but it may need another look.

On Wed, Jul 6, 2016 at 6:35 AM, Reynold Xin  wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Friday, July 8, 2016 at 23:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc2
> (4a55b2326c8cf50f772907a8b73fd5e7b3d1aa06).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1189/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/
>
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==
> What justifies a -1 vote for this release?
> ==
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Maciej Bryński
-1
https://issues.apache.org/jira/browse/SPARK-16379
https://issues.apache.org/jira/browse/SPARK-16371

2016-07-06 7:35 GMT+02:00 Reynold Xin :
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Friday, July 8, 2016 at 23:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc2
> (4a55b2326c8cf50f772907a8b73fd5e7b3d1aa06).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1189/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/
>
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==
> What justifies a -1 vote for this release?
> ==
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>



-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why's ds.foreachPartition(println) not possible?

2016-07-06 Thread Jacek Laskowski
Thanks Cody, Reynold, and Ryan! Learnt a lot and feel "corrected".

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jul 6, 2016 at 2:46 AM, Shixiong(Ryan) Zhu
 wrote:
> I asked this question in Scala user group two years ago:
> https://groups.google.com/forum/#!topic/scala-user/W4f0d8xK1nk
>
> Take a look if you are interested in.
>
> On Tue, Jul 5, 2016 at 1:31 PM, Reynold Xin  wrote:
>>
>> You can file it here: https://issues.scala-lang.org/secure/Dashboard.jspa
>>
>> Perhaps "bug" is not the right word, but "limitation". println accepts a
>> single argument of type Any and returns Unit, and it appears that Scala
>> fails to infer the correct overloaded method in this case.
>>
>>   def println() = Console.println()
>>   def println(x: Any) = Console.println(x)
>>
>>
>>
>> On Tue, Jul 5, 2016 at 1:27 PM, Cody Koeninger  wrote:
>>>
>>> I don't think that's a scala compiler bug.
>>>
>>> println is a valid expression that returns unit.
>>>
>>> Unit is not a single-argument function, and does not match any of the
>>> overloads of foreachPartition
>>>
>>> You may be used to a conversion taking place when println is passed to
>>> method expecting a function, but that's not a safe thing to do
>>> silently for multiple overloads.
>>>
>>> tldr;
>>>
>>> just use
>>>
>>> ds.foreachPartition(x => println(x))
>>>
>>> you don't need any type annotations
>>>
>>>
>>> On Tue, Jul 5, 2016 at 2:53 PM, Jacek Laskowski  wrote:
>>> > Hi Reynold,
>>> >
>>> > Is this already reported and tracked somewhere. I'm quite sure that
>>> > people will be asking about the reasons Spark does this. Where are
>>> > such issues reported usually?
>>> >
>>> > Pozdrawiam,
>>> > Jacek Laskowski
>>> > 
>>> > https://medium.com/@jaceklaskowski/
>>> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> > Follow me at https://twitter.com/jaceklaskowski
>>> >
>>> >
>>> > On Tue, Jul 5, 2016 at 6:19 PM, Reynold Xin 
>>> > wrote:
>>> >> This seems like a Scala compiler bug.
>>> >>
>>> >>
>>> >> On Tuesday, July 5, 2016, Jacek Laskowski  wrote:
>>> >>>
>>> >>> Well, there is foreach for Java and another foreach for Scala. That's
>>> >>> what I can understand. But while supporting two language-specific
>>> >>> APIs
>>> >>> -- Scala and Java -- Dataset API lost support for such simple calls
>>> >>> without type annotations so you have to be explicit about the variant
>>> >>> (since I'm using Scala I want to use Scala API right). It appears
>>> >>> that
>>> >>> any single-argument-function operators in Datasets are affected :(
>>> >>>
>>> >>> My question was to know whether there are works to fix it (if
>>> >>> possible
>>> >>> -- I don't know if it is).
>>> >>>
>>> >>> Pozdrawiam,
>>> >>> Jacek Laskowski
>>> >>> 
>>> >>> https://medium.com/@jaceklaskowski/
>>> >>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> >>> Follow me at https://twitter.com/jaceklaskowski
>>> >>>
>>> >>>
>>> >>> On Tue, Jul 5, 2016 at 4:21 PM, Sean Owen  wrote:
>>> >>> > Right, should have noticed that in your second mail. But foreach
>>> >>> > already does what you want, right? it would be identical here.
>>> >>> >
>>> >>> > How these two methods do conceptually different things on different
>>> >>> > arguments. I don't think I'd expect them to accept the same
>>> >>> > functions.
>>> >>> >
>>> >>> > On Tue, Jul 5, 2016 at 3:18 PM, Jacek Laskowski 
>>> >>> > wrote:
>>> >>> >> ds is Dataset and the problem is that println (or any other
>>> >>> >> one-element function) would not work here (and perhaps other
>>> >>> >> methods
>>> >>> >> with two variants - Java's and Scala's).
>>> >>> >>
>>> >>> >> Pozdrawiam,
>>> >>> >> Jacek Laskowski
>>> >>> >> 
>>> >>> >> https://medium.com/@jaceklaskowski/
>>> >>> >> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> >>> >> Follow me at https://twitter.com/jaceklaskowski
>>> >>> >>
>>> >>> >>
>>> >>> >> On Tue, Jul 5, 2016 at 3:53 PM, Sean Owen 
>>> >>> >> wrote:
>>> >>> >>> A DStream is a sequence of RDDs, not of elements. I don't think
>>> >>> >>> I'd
>>> >>> >>> expect to express an operation on a DStream as if it were
>>> >>> >>> elements.
>>> >>> >>>
>>> >>> >>> On Tue, Jul 5, 2016 at 2:47 PM, Jacek Laskowski 
>>> >>> >>> wrote:
>>> >>>  Sort of. Your example works, but could you do a mere
>>> >>>  ds.foreachPartition(println)? Why not? What should I even see
>>> >>>  the
>>> >>>  Java
>>> >>>  version?
>>> >>> 
>>> >>>  scala> val ds = spark.range(10)
>>> >>>  ds: org.apache.spark.sql.Dataset[Long] = [id: bigint]
>>> >>> 
>>> >>>  scala> ds.foreachPartition(println)
>>> >>>  :26: error: overloaded method value foreachPartition
>>> >>>  with
>>> >>>  alternatives:
>>> >>>    (func:
>>> >>> 
>>> >>>  org.apache.spark.api.java.function.ForeachPartitionFunction[Long])Unit
>>> >>>  
>>> >

Re: Spark Task failure with File segment length as negative

2016-07-06 Thread Priya Ch
Is anyone resolved this ?


Thanks,
Padma CH

On Wed, Jun 22, 2016 at 4:39 PM, Priya Ch 
wrote:

> Hi All,
>
> I am running Spark Application with 1.8TB of data (which is stored in Hive
> tables format).  I am reading the data using HiveContect and processing it.
> The cluster has 5 nodes total, 25 cores per machine and 250Gb per node. I
> am launching the application with 25 executors with 5 cores each and 45GB
> per executor. Also, specified the property
> spark.yarn.executor.memoryOverhead=2024.
>
> During the execution, tasks are lost and ShuffleMapTasks are re-submitted.
> I am seeing that tasks are failing with the following message -
>
> *java.lang.IllegalArgumentException: requirement failed: File segment
> length cannot be negative (got -27045427)*
>
>
>
>
>
>
>
>
>
> * at scala.Predef$.require(Predef.scala:233)*
>
>
>
>
>
>
>
>
>
> * at org.apache.spark.storage.FileSegment.(FileSegment.scala:28)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.storage.DiskBlockObjectWriter.fileSegment(DiskBlockObjectWriter.scala:220)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:184)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:398)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:206)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)*
>
>
>
>
>
>
>
>
>
> * at org.apache.spark.scheduler.Task.run(Task.scala:89)*
>
>
>
>
>
>
>
>
>
> * at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)*
>
>
>
>
>
>
>
>
>
> * at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)*
>
>
>
>
>
>
>
>
>
> * at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)*
>
>
>
>
>
>
>
>
>
> I understood that its because the shuffle block is > 2G, the Int value is
> taking negative and throwing the above exeception.
>
> Can someone throw light on this ? What is the fix for this ?
>
> Thanks,
> Padma CH
>
>
>
>
>
>
>
>
>
>
>


Re: MinMaxScaler With features include category variables

2016-07-06 Thread Yuhao Yang
You may also find VectorSlicer and SQLTransformer useful in your case. Just
out of curiosity, how would you typically handles categorical features,
except for OneHotEncoder.

Regards,
Yuhao

2016-07-01 4:00 GMT-07:00 Yanbo Liang :

> You can combine the columns which are need to be normalized into a vector
> by VectorAssembler and do normalization on it.
> Do another assembling for columns should not be normalized. At last, you
> can assemble the two vector into one vector as the feature column and feed
> it into model training.
>
> Thanks
> Yanbo
>
> 2016-06-25 21:16 GMT-07:00 段石石 :
>
>> Hi all:
>>
>>
>> I use the MinMaxScaler for data normalization, but I found the the
>> api is only for Vector, we must vectorized the features firtst. However,
>> the feature usually include two parts: one is need to be Normalization,
>> another should not be normalized such as categorical. I want to add a api
>> with the DataFrame which aim to normalize the columns which we want to
>> normalize. And then we can make it to be vector and sent to the ML model
>> api to train. I think that will be very useful for the developer with
>> machine learning.
>>
>>
>>
>> Best Regards
>>
>> Thanks
>>
>
>