date:20171128

Re: [Spark R]: dapply only works for very small datasets

2017-11-28 Thread Felix Cheung

You can find more discussions in https://issues.apache.org/jira/browse/SPARK-18924 And https://issues.apache.org/jira/browse/SPARK-17634 I suspect the cost is linear - so partitioning the data into smaller chunks with more executors (one core each) running in parallel would probably help a bit.

Re: NLTK with Spark Streaming

2017-11-28 Thread Nicholas Hakobian

Depending on your needs, its fairly easy to write a lightweight python wrapper around the Databricks spark-corenlp library: https://github.com/databricks/spark-corenlp Nicholas Szandor Hakobian, Ph.D. Staff Data Scientist Rally Health nicholas.hakob...@rallyhealth.com On Sun, Nov 26, 2017 at

Re: Writing custom Structured Streaming receiver

2017-11-28 Thread Hien Luu

Cool. Thanks nezhazheng. I will give it a shot. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Structured Streaming: emitted record count

2017-11-28 Thread aravias

In structured streaming, the QueryProgressEvent does not seem to have the final emitted record count to the destination, I see only the number of input rows. I was trying to use the count (additional action after persisting the dataset), but I face the below exception when calling persist or

Re: does "Deep Learning Pipelines" scale out linearly?

2017-11-28 Thread Tim Hunter

Hello Andy, regarding your question, this will depend a lot on the specific task: - for tasks that are "easy" to distribute such as inference (scoring), hyper-parameter tuning or cross-validation, these tasks will take full advantage of the cluster and the performance should improve more or less

Re: Spark Data Frame. PreSorded partitions

2017-11-28 Thread Michael Artz

I'm not sure other than retrieving from a hive table that is already sorted. This sounds cool though, would be interested to know this as well On Nov 28, 2017 10:40 AM, "Николай Ижиков" wrote: > Hello, guys! > > I work on implementation of custom DataSource for Spark

Spark Data Frame. PreSorded partitions

2017-11-28 Thread Николай Ижиков

Hello, guys! I work on implementation of custom DataSource for Spark Data Frame API and have a question: If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source. Do I have a built-in option to tell spark that data from each partition

AW: [Spark R]: dapply only works for very small datasets

2017-11-28 Thread Kunft, Andreas

Thanks for the fast reply. I tried it locally, with 1 - 8 slots on a 8 core machine w/ 25GB memory as well as on 4 nodes with the same specifications. When I shrink the data to around 100MB, it runs in about 1 hour for 1 core and about 6 min with 8 cores. I'm aware that the serDe takes

Re: [Spark R]: dapply only works for very small datasets

Re: NLTK with Spark Streaming

Re: Writing custom Structured Streaming receiver

Structured Streaming: emitted record count

Re: does "Deep Learning Pipelines" scale out linearly?

Re: Spark Data Frame. PreSorded partitions

Spark Data Frame. PreSorded partitions

AW: [Spark R]: dapply only works for very small datasets

8 matches

Site Navigation

Mail list logo

Footer information