Re: [Spark R]: dapply only works for very small datasets

2017-11-28 Thread Felix Cheung
You can find more discussions in https://issues.apache.org/jira/browse/SPARK-18924 And https://issues.apache.org/jira/browse/SPARK-17634 I suspect the cost is linear - so partitioning the data into smaller chunks with more executors (one core each) running in parallel would probably help a bit.

Re: NLTK with Spark Streaming

2017-11-28 Thread Nicholas Hakobian
Depending on your needs, its fairly easy to write a lightweight python wrapper around the Databricks spark-corenlp library: https://github.com/databricks/spark-corenlp Nicholas Szandor Hakobian, Ph.D. Staff Data Scientist Rally Health nicholas.hakob...@rallyhealth.com On Sun, Nov 26, 2017 at

Re: Writing custom Structured Streaming receiver

2017-11-28 Thread Hien Luu
Cool. Thanks nezhazheng. I will give it a shot. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Structured Streaming: emitted record count

2017-11-28 Thread aravias
In structured streaming, the QueryProgressEvent does not seem to have the final emitted record count to the destination, I see only the number of input rows. I was trying to use the count (additional action after persisting the dataset), but I face the below exception when calling persist or

Re: does "Deep Learning Pipelines" scale out linearly?

2017-11-28 Thread Tim Hunter
Hello Andy, regarding your question, this will depend a lot on the specific task:  - for tasks that are "easy" to distribute such as inference (scoring), hyper-parameter tuning or cross-validation, these tasks will take full advantage of the cluster and the performance should improve more or less

Re: Spark Data Frame. PreSorded partitions

2017-11-28 Thread Michael Artz
I'm not sure other than retrieving from a hive table that is already sorted. This sounds cool though, would be interested to know this as well On Nov 28, 2017 10:40 AM, "Николай Ижиков" wrote: > Hello, guys! > > I work on implementation of custom DataSource for Spark

Spark Data Frame. PreSorded partitions

2017-11-28 Thread Николай Ижиков
Hello, guys! I work on implementation of custom DataSource for Spark Data Frame API and have a question: If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source. Do I have a built-in option to tell spark that data from each partition

AW: [Spark R]: dapply only works for very small datasets

2017-11-28 Thread Kunft, Andreas
Thanks for the fast reply. I tried it locally, with 1 - 8 slots on a 8 core machine w/ 25GB memory as well as on 4 nodes with the same specifications. When I shrink the data to around 100MB, it runs in about 1 hour for 1 core and about 6 min with 8 cores. I'm aware that the serDe takes