Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Saisai Shao
Hi Alex, >From my understanding the community is shifting the effort from RDD based APIs to Dataset/DataFrame based ones, so for me it is not so necessary to add a new RDD based API as I mentioned before. Also for the problem of so many partitions, I think there're many other solutions to handle

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Alexander Pivovarov
Saisai, Reynold, Thank you for your replies. I also think that many variation of textFile() methods might be confusing for users. Better to have just one good textFile() implementation. Do you think sc.textFile() should use CombineTextInputFormat instead of TextInputFormat?

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Xiao Li
Changed my vote to +1. Thanks! 2016-05-19 13:28 GMT-07:00 Xiao Li : > Will do. Thanks! > > 2016-05-19 13:26 GMT-07:00 Reynold Xin : > >> Xiao thanks for posting. Please file a bug in JIRA. Again as I said in >> the email this is not meant to be a

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
Kai, You can simply ignore this test failure before it is fixed > On May 20, 2016, at 12:54, Sun Rui wrote: > > Yes. I also met this issue. It is likely related to recent R versions. > Could you help to submit a JIRA issue? I will take a look at it >> On May 20, 2016, at

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Hyukjin Kwon
I happened to test SparkR in Windows 7 (32 bits) and it seems some tests are failed. Could this be a reason to downvote? For more details of the tests, please see https://github.com/apache/spark/pull/13165#issuecomment-220515182 2016-05-20 13:44 GMT+09:00 Takuya UESHIN

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
Yes. I also met this issue. It is likely related to recent R versions. Could you help to submit a JIRA issue? I will take a look at it > On May 20, 2016, at 11:13, Kai Jiang wrote: > > I was trying to build SparkR this week. hmm~ But, I encountered problem with > SparkR unit

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Saisai Shao
>From my understanding I think newAPIHadoopFile or hadoopFIle is generic enough for you to support any InputFormat you wanted. IMO it is not so necessary to add a new API for this. On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov wrote: > Spark users might not know

Re: SparkR dataframe error

2016-05-19 Thread Kai Jiang
I was trying to build SparkR this week. hmm~ But, I encountered problem with SparkR unit testing. That is probably similar as Gayathri encountered with. I tried many times with running ./R/run-tests.sh script. It seems like every time the test will be failed. Here are some environments when I was

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Takuya UESHIN
-1 (non-binding) I filed 2 major bugs of Spark SQL: SPARK-15308 : RowEncoder should preserve nested column name. SPARK-15313 : EmbedSerializerInFilter rule should keep exprIds of output of

Re: SBT doesn't pick resource file after clean

2016-05-19 Thread dhruve ashar
Based on the conversation on PR, the intent was not to pollute the source directory and hence we are placing the generated file outside it in the target/extra-resources directory. I agree that the "sbt way" is to add the generated resources under the resourceManaged setting which was essentially

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
You must specify -Psparkr when building from source. > On May 20, 2016, at 08:09, Gayathri Murali > wrote: > > That helped! Thanks. I am building from source code and I am not sure what > caused the issue with SparkR. > > On Thu, May 19, 2016 at 4:17 PM, Xiangrui

Re: SparkR dataframe error

2016-05-19 Thread Gayathri Murali
That helped! Thanks. I am building from source code and I am not sure what caused the issue with SparkR. On Thu, May 19, 2016 at 4:17 PM, Xiangrui Meng wrote: > We no longer have `SparkRWrappers` in Spark 2.0. So if you are testing the > latest branch-2.0, there could be an

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Jeff Zhang
@Xiao, It is tracked in SPARK-15345 On Fri, May 20, 2016 at 4:20 AM, Xiao Li wrote: > -1 > > Unable to use Hive meta-store in pyspark shell. Tried both HiveContext and > SparkSession. Both failed. It always uses

Re: Spark driver and yarn behavior

2016-05-19 Thread Shankar Venkataraman
Thanks Luciano. The case we are seeing is different - the yarn resource manager is shutting down the container in which the executor is running since there does not seem to be a response and it is deeming it dead. It started another container but the driver seems to be oblivious for nearly 2

Re: Spark driver and yarn behavior

2016-05-19 Thread Luciano Resende
On Thu, May 19, 2016 at 3:16 PM, Shankar Venkataraman < shankarvenkataraman...@gmail.com> wrote: > Hi! > > We are running into an interesting behavior with the Spark driver. We > Spark running under Yarn. The spark driver seems to be sending work to a > dead executor for 3 hours before it

Re: SparkR dataframe error

2016-05-19 Thread Xiangrui Meng
We no longer have `SparkRWrappers` in Spark 2.0. So if you are testing the latest branch-2.0, there could be an issue with your SparkR installation. Did you try `R/install-dev.sh`? On Thu, May 19, 2016 at 11:42 AM Gayathri Murali < gayathri.m.sof...@gmail.com> wrote: > This is on Spark 2.0. I

Re: Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Reynold Xin
The old one is deprecated but should still work though. On Thu, May 19, 2016 at 3:51 PM, Arun Allamsetty wrote: > Hi Doug, > > If you look at the API docs here: >

Re: Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Arun Allamsetty
Hi Doug, If you look at the API docs here: http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/api/scala/index.html#org.apache.spark.sql.hive.HiveContext, you'll see Deprecate* (Since version 2.0.0)* Use SparkSession.builder.enableHiveSupport instead So you probably need to

Spark driver and yarn behavior

2016-05-19 Thread Shankar Venkataraman
Hi! We are running into an interesting behavior with the Spark driver. We Spark running under Yarn. The spark driver seems to be sending work to a dead executor for 3 hours before it recognizes it. The workload seems to have been processed by other executors just fine and we see no loss in

Re: Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Michael Armbrust
> > 1. “val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)” > doesn’t work because “HiveContext not a member of > org.apache.spark.sql.hive” I checked the documentation, and it looks like > it should still work for spark-2.0.0-preview-bin-hadoop2.7.tgz > HiveContext has been

Re: SBT doesn't pick resource file after clean

2016-05-19 Thread Jakob Odersky
To echo my comment on the PR: I think the "sbt way" to add extra, generated resources to the classpath is by adding a new task to the `resourceGenerators` setting. Also, the task should output any files into the directory specified by the `resourceManaged` setting. See

Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Doug Balog
I haven’t had time to really look into this problem, but I want to mention it. I downloaded http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-bin/spark-2.0.0-preview-bin-hadoop2.7.tgz and tried to run it against our Secure Hadoop cluster and access a Hive table. 1. “val

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread vishnu prasad
+1 On 20 May 2016 at 01:19, Herman van Hövell tot Westerflier < hvanhov...@questtec.nl> wrote: > +1 > > > 2016-05-19 18:20 GMT+02:00 Xiangrui Meng : > >> +1 >> >> On Thu, May 19, 2016 at 9:18 AM Joseph Bradley >> wrote: >> >>> +1 >>> >>> On Wed, May

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Xiao Li
Will do. Thanks! 2016-05-19 13:26 GMT-07:00 Reynold Xin : > Xiao thanks for posting. Please file a bug in JIRA. Again as I said in the > email this is not meant to be a functional release and will contain bugs. > > On Thu, May 19, 2016 at 1:20 PM, Xiao Li

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Reynold Xin
Xiao thanks for posting. Please file a bug in JIRA. Again as I said in the email this is not meant to be a functional release and will contain bugs. On Thu, May 19, 2016 at 1:20 PM, Xiao Li wrote: > -1 > > Unable to use Hive meta-store in pyspark shell. Tried both

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Xiao Li
-1 Unable to use Hive meta-store in pyspark shell. Tried both HiveContext and SparkSession. Both failed. It always uses in-memory catalog. Anybody else hit the same issue? Method 1: SparkSession >>> from pyspark.sql import SparkSession >>> spark =

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Herman van Hövell tot Westerflier
+1 2016-05-19 18:20 GMT+02:00 Xiangrui Meng : > +1 > > On Thu, May 19, 2016 at 9:18 AM Joseph Bradley > wrote: > >> +1 >> >> On Wed, May 18, 2016 at 10:49 AM, Reynold Xin >> wrote: >> >>> Hi Ovidiu-Cristian , >>> >>> The best

Re: SparkR dataframe error

2016-05-19 Thread Gayathri Murali
This is on Spark 2.0. I see the following on the unit-tests.log when I run the R/run-tests.sh. This on a single MAC laptop, on the recently rebased master. R version is 3.3.0. 16/05/19 11:28:13.863 Executor task launch worker-1 ERROR Executor: Exception in task 0.0 in stage 5186.0 (TID 10370)

Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Andrew Or
+1, some maintainers are hard to find 2016-05-19 9:03 GMT-07:00 Imran Rashid : > +1 (binding) on removal of maintainers > > I dont' have a strong opinion yet on how to have a system for finding the > right reviewers. I agree it would be nice to have something to help you >

Dataset reduceByKey

2016-05-19 Thread Andres Perez
Hi all, We were in the process of porting an RDD program to one which uses Datasets. Most things were easy to transition, but one hole in functionality we found was the ability to reduce a Dataset by key, something akin to PairRDDFunctions.reduceByKey. Our first attempt of adding the

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Alexander Pivovarov
Spark users might not know about CombineTextInputFormat. They probably think that sc.textFile already implements the best way to read text files. I think CombineTextInputFormat can replace regular TextInputFormat in most of the cases. Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Joseph Bradley
+1 On Wed, May 18, 2016 at 10:49 AM, Reynold Xin wrote: > Hi Ovidiu-Cristian , > > The best source of truth is change the filter with target version to > 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we > get closer to 2.0 release, more will be

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Yin Huai
+1 On Wed, May 18, 2016 at 10:49 AM, Reynold Xin wrote: > Hi Ovidiu-Cristian , > > The best source of truth is change the filter with target version to > 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we > get closer to 2.0 release, more will be

Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Imran Rashid
+1 (binding) on removal of maintainers I dont' have a strong opinion yet on how to have a system for finding the right reviewers. I agree it would be nice to have something to help you find reviewers, though I'm a little skeptical of anything automatic. On Thu, May 19, 2016 at 10:34 AM, Matei

Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Nicholas Chammas
I’ve also heard that we should try to keep some other instructions for contributors to find the “right” reviewers, so it would be great to see suggestions on that. For my part, I’d personally prefer something “automatic”, such as easily tracking who reviewed each patch and having people look at

right outer joins on Datasets

2016-05-19 Thread Andres Perez
Hi all, I'm getting some odd behavior when using the joinWith functionality for Datasets. Here is a small test case: val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS() val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS() val joined = left.toDF("k", "v").as[(String,

Re: SparkR dataframe error

2016-05-19 Thread Xiangrui Meng
Is it on 1.6.x? On Wed, May 18, 2016, 6:57 PM Sun Rui wrote: > I saw it, but I can’t see the complete error message on it. > I mean the part after “error in invokingJava(…)” > > On May 19, 2016, at 08:37, Gayathri Murali > wrote: > > There was

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng
Not exacly the same as the one you suggested but you can chain it with flatMap to get what you want, if each file is not huge. On Thu, May 19, 2016, 8:41 AM Xiangrui Meng wrote: > This was implemented as sc.wholeTextFiles. > > On Thu, May 19, 2016, 2:43 AM Reynold Xin

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Reynold Xin
It is different isn't it. Whole text files returns one element per file, whereas combined inout format is similar to coalescing partitions to bin pack into a certain size. On Thursday, May 19, 2016, Xiangrui Meng wrote: > This was implemented as sc.wholeTextFiles. > > On Thu,

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng
This was implemented as sc.wholeTextFiles. On Thu, May 19, 2016, 2:43 AM Reynold Xin wrote: > Users would be able to run this already with the 3 lines of code you > supplied right? In general there are a lot of methods already on > SparkContext and we lean towards the more

Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Tom Graves
+1 (binding) Tom On Thursday, May 19, 2016 10:35 AM, Matei Zaharia wrote: Hi folks, Around 1.5 years ago, Spark added a maintainer process for reviewing API and architectural changes

Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Mridul Muralidharan
+1 (binding) on removing maintainer process. I agree with your opinion of "automatic " instead of a manual list. Regards Mridul On Thursday, May 19, 2016, Matei Zaharia wrote: > Hi folks, > > Around 1.5 years ago, Spark added a maintainer process for reviewing API >

[DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Matei Zaharia
Hi folks, Around 1.5 years ago, Spark added a maintainer process for reviewing API and architectural changes (https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers) to make sure these are seen by people who spent a lot of time on that component.

Spark Security. Generating SSL keystore for each job

2016-05-19 Thread ScoRp
Hello, I have a question about Spark Security. On Spark Documentation it is stated that when Spark is running on YARN, keystore for SSL encryption should be generated for each job and may be distributed using spark.yarn.dist.files option. But there is no implementation for this is Spark Source

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Reynold Xin
Users would be able to run this already with the 3 lines of code you supplied right? In general there are a lot of methods already on SparkContext and we lean towards the more conservative side in introducing new API variants. Note that this is something we are doing automatically in Spark SQL

Re: Nested/Chained case statements generate codegen over 64k exception

2016-05-19 Thread Jonathan Gray
That makes sense, I will take a look there first. That will at least give a clearer understanding of the problem space to determine when to fallback. On 15 May 2016 3:02 am, "Reynold Xin" wrote: > It might be best to fix this with fallback first, and then figure out how > we