[PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Judit Planas
Dear all, I am trying to store a NumPy array (loaded from an HDF5 dataset) into one cell of a DataFrame, but I am having problems. In short, my data layout is similar to a database, where I have a few columns with metadata (source of information, primary key,

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Judit Planas
Hi Amol, Not sure I understand completely your question, but the SQL function "explode" may help you: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode Here you can find a nice example: https://stackoverflow.com/questions/38210507/explode-in-pyspark

Re: [PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Judit Planas
se(np.array([0, 1, 2])))]) On Wed, 28 Jun 2017 at 12:23 Judit Planas <judit.pla...@epfl.ch> wrote: Dear all, I am trying to store a NumPy

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Judit Planas
t;,explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment"))) Ayan, Output of books_inexp.show was as below title, author Midnight Rain,Ralls, Kim Maeve Ascendant, Co

[PySpark] Multiple driver cores

2017-08-02 Thread Judit Planas
Hello, I recently came across the "--driver-cores" option when, for example, launching a PySpark shell. Provided that there are idle CPUs on driver's node, what would be the benefit of having multiple driver cores? For example, will this accelerate the