Re: Why is shuffle write size so large when joining Dataset with nested structure?

2016-11-25 Thread Takeshi Yamamuro
Hi, I think this is just the overhead to represent nested elements as internal rows on-runtime (e.g., it consumes null bits for each nested element). Moreover, in parquet formats, nested data are columnar and highly compressed, so it becomes so compact. But, I'm not sure about better approaches

Why is shuffle write size so large when joining Dataset with nested structure?

2016-11-25 Thread taozhuo
The Dataset is defined as case class with many fields with nested structure(Map, List of another case class etc.) The size of the Dataset is only 1T when saving to disk as Parquet file. But when joining it, the shuffle write size becomes as large as 12T. Is there a way to cut it down without

Re: Third party library

2016-11-25 Thread Reynold Xin
bcc dev@ and add user@ This is more a user@ list question rather than a dev@ list question. You can do something like this: object MySimpleApp { def loadResources(): Unit = // define some idempotent way to load resources, e.g. with a flag or lazy val def main() = { ...

Re: covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-25 Thread Steve Loughran
have the first path to be something like .csv("file://home/user/dataset/data.csv") If you working with files that big -don't use the inferSchema option, as that will trigger two scans through the data -try with a smaller file first, say 1MB or so Trying to use spark *or any other tool* to

Update Cassandra null value

2016-11-25 Thread Tomas Carini
case class RowObj(page: Int, country: String, id: String, step: Int) sc.cassandraTable[RowObj]("keyspace","table").map(row => row.copy(step = 8)).saveToCassandra("keyspace","table",SomeColumns("page","country","id","step")) Hi guys. Im pretty new to spark and more on Scala. Im trying to do a

Tracking opened files by Spark application

2016-11-25 Thread David Lauzon
Hi there! Does Apache Spark offers callback or some kind of plugin mechanism that would allow this? For a bit more context, I am looking to create a Record-Replay environment using Apache Spark as the core processing engine. The goals are: 1. Trace the origins of every file generated by Spark:

Multilabel classification with Spark MLlib

2016-11-25 Thread Md. Rezaul Karim
Hello All, Is there anyone who has developed multilabel classification applications with Spark? I found an example class in Spark distribution (i.e., *JavaMultiLabelClassificationMetricsExample.java*) which is not a classifier but an evaluator for a multilabel classification. Moreover, the

Re: OS killing Executor due to high (possibly off heap) memory usage

2016-11-25 Thread Aniket Bhatnagar
Thanks Rohit, Roddick and Shreya. I tried changing spark.yarn.executor.memoryOverhead to be 10GB and lowering executor memory to 30 GB and both of these didn't work. I finally had to reduce the number of cores per executor to be 18 (from 36) in addition to setting higher

RDD persist() not honoured

2016-11-25 Thread Ioannis.Deligiannis
Hi, I have run into a weird caching problem (Using Spark 1.3.1 + Java 1.8.0) that I can only explain as a bug. In summary, I source the RDD from an Avro file, I apply a mapToPair Function, count & cache. However, the RDD is not cached nor it appears in Spark UI Storage. (This is not cached at

[StackOverflow] Size exceeds Integer.MAX_VALUE When Joining 2 Large DFs

2016-11-25 Thread Gerard Maas
This question seems to deserve an scalation from Stack Overflow: http://stackoverflow.com/questions/40803969/spark-size-exceeds-integer-max-value-when-joining-2-large-dfs Looks like an important limitation. -kr, Gerard. Meta:PS: What do you think would be the best way to scalate from SO?

Re: Does SparkR or SparkMLib support nonlinear optimization with non linear constraints

2016-11-25 Thread Robineast
I provided an answer to a similar question here: https://www.mail-archive.com/user@spark.apache.org/msg57697.html --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: Any equivalent method lateral and explore

2016-11-25 Thread Jacek Laskowski
Hi, Interesting, but I personally would opt for withColumn since it'd be less to type (and also be consistent with ticks (')) as follows: df.withColumn(explode('myArray) as 'arrayItem) (Spark SQL made my SQL developer's life so easy these days :)) Pozdrawiam, Jacek Laskowski