You can find more discussions in
https://issues.apache.org/jira/browse/SPARK-18924
And
https://issues.apache.org/jira/browse/SPARK-17634
I suspect the cost is linear - so partitioning the data into smaller chunks
with more executors (one core each) running in parallel would probably help a
bit.
Depending on your needs, its fairly easy to write a lightweight python
wrapper around the Databricks spark-corenlp library:
https://github.com/databricks/spark-corenlp
Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com
On Sun, Nov 26, 2017 at
Cool. Thanks nezhazheng. I will give it a shot.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
In structured streaming, the QueryProgressEvent does not seem to have
the final emitted record count to the destination, I see only the number of
input rows. I was trying to use the count (additional action after
persisting the dataset), but I face the below exception when calling persist
or
Hello Andy,
regarding your question, this will depend a lot on the specific task:
- for tasks that are "easy" to distribute such as inference
(scoring), hyper-parameter tuning or cross-validation, these tasks
will take full advantage of the cluster and the performance should
improve more or less
I'm not sure other than retrieving from a hive table that is already
sorted. This sounds cool though, would be interested to know this as well
On Nov 28, 2017 10:40 AM, "Николай Ижиков" wrote:
> Hello, guys!
>
> I work on implementation of custom DataSource for Spark
Hello, guys!
I work on implementation of custom DataSource for Spark Data Frame API and have
a question:
If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data
inside a partition in my data source.
Do I have a built-in option to tell spark that data from each partition
Thanks for the fast reply.
I tried it locally, with 1 - 8 slots on a 8 core machine w/ 25GB memory as well
as on 4 nodes with the same specifications.
When I shrink the data to around 100MB,
it runs in about 1 hour for 1 core and about 6 min with 8 cores.
I'm aware that the serDe takes