Re: number of partitions in join: Spark documentation misleading!

2015-06-16 Thread Davies Liu
Please file a JIRA for it. On Mon, Jun 15, 2015 at 8:00 AM, mrm ma...@skimlinks.com wrote: Hi all, I was looking for an explanation on the number of partitions for a joined rdd. The documentation of Spark 1.3.1. says that: For distributed shuffle operations like reduceByKey and join, the

Re: BigDecimal problem in parquet file

2015-06-12 Thread Davies Liu
Maybe it's related to a bug, which is fixed by https://github.com/apache/spark/pull/6558 recently. On Fri, Jun 12, 2015 at 5:38 AM, Bipin Nag bipin@gmail.com wrote: Hi Cheng, Yes, some rows contain unit instead of decimal values. I believe some rows from original source I had don't have

Re: SparkSQL nested dictionaries

2015-06-08 Thread Davies Liu
I think it works in Python ``` df = sqlContext.createDataFrame([(1, {'a': 1})]) df.printSchema() root |-- _1: long (nullable = true) |-- _2: map (nullable = true) ||-- key: string ||-- value: long (valueContainsNull = true) df.select(df._2.getField('a')).show() +-+ |_2[a]|

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Davies Liu
crashes. On Tue, Jun 2, 2015 at 5:06 AM, Davies Liu dav...@databricks.com wrote: Could you run the single thread version in worker machine to make sure that OpenCV is installed and configured correctly? On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga sammiest...@gmail.com wrote: I've verified

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Davies Liu
, 2015 at 2:43 PM, Davies Liu dav...@databricks.com wrote: Please file a bug here: https://issues.apache.org/jira/browse/SPARK/ Could you also provide a way to reproduce this bug (including some datasets)? On Thu, Jun 4, 2015 at 11:30 PM, Sam Stoelinga sammiest...@gmail.com wrote: I've

Re: PySpark with OpenCV causes python worker to crash

2015-06-01 Thread Davies Liu
to crash the whole python executor. On Fri, May 29, 2015 at 2:06 AM, Davies Liu dav...@databricks.com wrote: Could you try to comment out some lines in `extract_sift_features_opencv` to find which line cause the crash? If the bytes came from sequenceFile() is broken, it's easy to crash a C

Re: Best strategy for Pandas - Spark

2015-06-01 Thread Davies Liu
The second one sounds reasonable, I think. On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Let's assume I have a complex workflow of more than 10 datasources as input - 20 computations (some creating intermediary datasets and some merging

Re: deos randomSplit return a copy or a reference to the original rdd? [Python]

2015-06-01 Thread Davies Liu
No, all of the RDDs (including those returned from randomSplit()) are read-only. On Mon, Apr 27, 2015 at 11:28 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: Suppose I have something like the code below for idx in xrange(0, 10): train_test_split =

Re: Python implementation of RDD interface

2015-05-29 Thread Davies Liu
, May 29, 2015 at 2:46 PM Davies Liu dav...@databricks.com wrote: There is another implementation of RDD interface in Python, called DPark [1], Could you have a few words to compare these two? [1] https://github.com/douban/dpark/ On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss s...@svenkreiss.com

Re: Python implementation of RDD interface

2015-05-29 Thread Davies Liu
There is another implementation of RDD interface in Python, called DPark [1], Could you have a few words to compare these two? [1] https://github.com/douban/dpark/ On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss s...@svenkreiss.com wrote: I wanted to share a Python implementation of RDDs:

Re: PySpark with OpenCV causes python worker to crash

2015-05-28 Thread Davies Liu
Could you try to comment out some lines in `extract_sift_features_opencv` to find which line cause the crash? If the bytes came from sequenceFile() is broken, it's easy to crash a C library in Python (OpenCV). On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga sammiest...@gmail.com wrote: Hi

Re: PySpark Unknown Opcode Error

2015-05-26 Thread Davies Liu
This should be the case that you run different versions for Python in driver and slaves, Spark 1.4 will double check that will release soon). SPARK_PYTHON should be PYSPARK_PYTHON On Tue, May 26, 2015 at 11:21 AM, Nikhil Muralidhar nmural...@gmail.com wrote: Hello, I am trying to run a

Re: Bigints in pyspark

2015-05-22 Thread Davies Liu
Could you show up the schema and confirm that they are LongType? df.printSchema() On Mon, Apr 27, 2015 at 5:44 AM, jamborta jambo...@gmail.com wrote: hi all, I have just come across a problem where I have a table that has a few bigint columns, it seems if I read that table into a dataframe

Re: [pyspark] Starting workers in a virtualenv

2015-05-21 Thread Davies Liu
Could you try with specify PYSPARK_PYTHON to the path of python in your virtual env, for example PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py On Mon, Apr 20, 2015 at 12:51 AM, Karlson ksonsp...@siberie.de wrote: Hi all, I am running the Python process that communicates with

Re: Is this a good use case for Spark?

2015-05-20 Thread Davies Liu
Spark is a great framework to do things in parallel with multiple machines, will be really helpful for your case. Once you can wrap your entire pipeline into a single Python function: def process_document(path, text): # you can call other tools or services here return xxx then you can

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-20 Thread Davies Liu
AM, Davies Liu dav...@databricks.com wrote: SparkContext can be used in multiple threads (Spark streaming works with multiple threads), for example: import threading import time def show(x): time.sleep(1) print x def job(): sc.parallelize(range(100)).foreach(show

Re: Spark 1.3.1 - SQL Issues

2015-05-20 Thread Davies Liu
The docs had been updated. You should convert the DataFrame to RDD by `df.rdd` On Mon, Apr 20, 2015 at 5:23 AM, ayan guha guha.a...@gmail.com wrote: Hi Just upgraded to Spark 1.3.1. I am getting an warning Warning (from warnings module): File

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-20 Thread Davies Liu
W dniu 19.05.2015 o 23:56, Davies Liu pisze: It surprises me, could you list the owner information of /mnt/lustre/bigdata/med_home/tmp/test19EE/ ? On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes tomasz.frub...@fuw.edu.pl mailto:tomasz.frub...@fuw.edu.pl

Re: Does Python 2.7 have to be installed on every cluster node?

2015-05-19 Thread Davies Liu
PySpark work with CPython by default, and you can specify which version of Python to use by: PYSPARK_PYTHON=path/to/path bin/spark-submit xxx.py When you do the upgrade, you could install python 2.7 on every machine in the cluster, test it with PYSPARK_PYTHON=python2.7 bin/spark-submit xxx.py

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-19 Thread Davies Liu
It surprises me, could you list the owner information of /mnt/lustre/bigdata/med_home/tmp/test19EE/ ? On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes tomasz.frub...@fuw.edu.pl wrote: Dear Experts, we have a spark cluster (standalone mode) in which master and workers are started from root

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-18 Thread Davies Liu
SparkContext can be used in multiple threads (Spark streaming works with multiple threads), for example: import threading import time def show(x): time.sleep(1) print x def job(): sc.parallelize(range(100)).foreach(show) threading.Thread(target=job).start() On Mon, May 18,

Re: pass configuration parameters to PySpark job

2015-05-18 Thread Davies Liu
In PySpark, it serializes the functions/closures together with used global values. For example, global_param = 111 def my_map(x): return x + global_param rdd.map(my_map) - Davies On Mon, May 18, 2015 at 7:26 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I am looking a way to

Re: PySpark: slicing issue with dataframes

2015-05-17 Thread Davies Liu
Yes, it's a bug, please file a JIRA. On Sun, May 3, 2015 at 10:36 AM, Ali Bajwa ali.ba...@gmail.com wrote: Friendly reminder on this one. Just wanted to get a confirmation that this is not by design before I logged a JIRA Thanks! Ali On Tue, Apr 28, 2015 at 9:53 AM, Ali Bajwa

Re: how to set random seed

2015-05-17 Thread Davies Liu
The python workers used for each stage may be different, this may not work as expected. You can create a Random object, set the seed, use it to do the shuffle(). r = random.Random() r.seek(my_seed) def f(x): r.shuffle(l) rdd.map(f) On Thu, May 14, 2015 at 6:21 AM, Charles Hayden

Re: make two rdd co-partitioned in python

2015-04-09 Thread Davies Liu
In Spark 1.3+, PySpark also support this kind of narrow dependencies, for example, N = 10 a1 = a.partitionBy(N) b1 = b.partitionBy(N) then a1.union(b1) will only have N partitions. So, a1.join(b1) do not need shuffle anymore. On Thu, Apr 9, 2015 at 11:57 AM, pop xia...@adobe.com wrote: In

Re: Issue with pyspark 1.3.0, sql package and rows

2015-04-08 Thread Davies Liu
I will look into this today. On Wed, Apr 8, 2015 at 7:35 AM, Stefano Parmesan parme...@spaziodati.eu wrote: Did anybody by any chance had a look at this bug? It keeps on happening to me, and it's quite blocking, I would like to understand if there's something wrong in what I'm doing, or

Re: Python Example sql.py not working in version spark-1.3.0-bin-hadoop2.4

2015-03-27 Thread Davies Liu
This will be fixed in https://github.com/apache/spark/pull/5230/files On Fri, Mar 27, 2015 at 9:13 AM, Peter Mac peter.machar...@noaa.gov wrote: I downloaded spark version spark-1.3.0-bin-hadoop2.4. When the python version of sql.py is run the following error occurs: [root@nde-dev8-template

Re: python : Out of memory: Kill process

2015-03-26 Thread Davies Liu
at 10:41 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. I ran the same code as before, I need to make any changes? On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu dav...@databricks.com wrote: With batchSize = 1, I

Re: python : Out of memory: Kill process

2015-03-26 Thread Davies Liu
(spark.kryoserializer.buffer.mb,512)) sc = SparkContext(conf=conf ) sqlContext = SQLContext(sc) On Thu, Mar 26, 2015 at 2:29 PM, Davies Liu dav...@databricks.com wrote: Could you try to remove the line `log2.cache()` ? On Thu, Mar 26, 2015 at 10:02 AM, Eduardo Cusa eduardo.c

Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, I running the following

Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote: What's the version of Spark you are running

Re: Spark 1.2. loses often all executors

2015-03-20 Thread Davies Liu
Maybe this is related to a bug in 1.2 [1], it's fixed in 1.2.2 (not released), could checkout the 1.2 branch and verify that? [1] https://issues.apache.org/jira/browse/SPARK-5788 On Fri, Mar 20, 2015 at 3:21 AM, mrm ma...@skimlinks.com wrote: Hi, I recently changed from Spark 1.1. to Spark

Re: Spark-submit and multiple files

2015-03-20 Thread Davies Liu
the error log. On Thu, Mar 19, 2015 at 8:03 PM, Davies Liu dav...@databricks.com wrote: You could submit additional Python source via --py-files , for example: $ bin/spark-submit --py-files work.py main.py On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez guilla...@databerries.com wrote: Hello guys

Re: spark there is no space on the disk

2015-03-19 Thread Davies Liu
Is it possible that `spark.local.dir` is overriden by others? The docs say: NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) On Sat, Mar 14, 2015 at 5:29 PM, Peng Xia sparkpeng...@gmail.com wrote: Hi Sean, Thank very much for

Re: Error when using multiple python files spark-submit

2015-03-19 Thread Davies Liu
the options of spark-submit should come before main.py, or they will become the options of main.py, so it should be: ../hadoop/spark-install/bin/spark-submit --py-files /home/poiuytrez/naive.py,/home/poiuytrez/processing.py,/home/poiuytrez/settings.py --master spark://spark-m:7077 main.py

Re: Spark 1.3 createDataframe error with pandas df

2015-03-19 Thread Davies Liu
On Mon, Mar 16, 2015 at 6:23 AM, kevindahl kevin.d...@gmail.com wrote: kevindahl wrote I'm trying to create a spark data frame from a pandas data frame, but for even the most trivial of datasets I get an error along the lines of this:

Re: Spark-submit and multiple files

2015-03-19 Thread Davies Liu
You could submit additional Python source via --py-files , for example: $ bin/spark-submit --py-files work.py main.py On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez guilla...@databerries.com wrote: Hello guys, I am having a hard time to understand how spark-submit behave with multiple files. I

Re: How to consider HTML files in Spark

2015-03-12 Thread Davies Liu
sc.wholeTextFile() is what you need. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles On Thu, Mar 12, 2015 at 9:26 AM, yh18190 yh18...@gmail.com wrote: Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark + Beautifulsoup to

Re: Python script runs fine in local mode, errors in other modes

2015-03-08 Thread Davies Liu
?filter=-1 On Tue, Aug 19, 2014 at 12:12 PM, Davies Liu dav...@databricks.com wrote: This script run very well without your CSV file. Could download you CSV file into local disks, and narrow down to the lines which triggle this issue? On Tue, Aug 19, 2014 at 12:02 PM, Aaron aaron.doss

Re: Speed Benchmark

2015-02-27 Thread Davies Liu
@gmail.com wrote: It is a simple text file. I'm not using SQL. just doing a rdd.count() on it. Does the bug affect it? On Friday, February 27, 2015, Davies Liu dav...@databricks.com wrote: What is this dataset? text file or parquet file? There is an issue with serialization in Spark SQL, which

Re: Speed Benchmark

2015-02-27 Thread Davies Liu
What is this dataset? text file or parquet file? There is an issue with serialization in Spark SQL, which will make it very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will be fixed very soon. Davies On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy guillaume.c@gmail.com wrote:

Re: Spark 1.3 dataframe documentation

2015-02-24 Thread Davies Liu
Another way to see the Python docs: $ export PYTHONPATH=$SPARK_HOME/python $ pydoc pyspark.sql On Tue, Feb 24, 2015 at 2:01 PM, Reynold Xin r...@databricks.com wrote: The official documentation will be posted when 1.3 is released (early March). Right now, you can build the docs yourself by

Re: Spark Performance on Yarn

2015-02-21 Thread Davies Liu
How many executors you have per machine? It will be helpful if you could list all the configs. Could you also try to run it without persist? Caching do hurt than help, if you don't have enough memory. On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier...@gmail.com wrote: Thanks for the

Re: stack map functions in a loop (pyspark)

2015-02-19 Thread Davies Liu
On Thu, Feb 19, 2015 at 7:57 AM, jamborta jambo...@gmail.com wrote: Hi all, I think I have run into an issue on the lazy evaluation of variables in pyspark, I have to following functions = [func1, func2, func3] for counter in range(len(functions)): data = data.map(lambda value:

Re: Spark can't pickle class: error cannot lookup attribute

2015-02-18 Thread Davies Liu
Currently, PySpark can not support pickle a class object in current script ( '__main__'), the workaround could be put the implementation of the class into a separate module, then use bin/spark-submit --py-files xxx.py in deploy it. in xxx.py: class test(object): def __init__(self, a, b):

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Davies Liu
. On Mon, Feb 16, 2015 at 4:04 PM, Davies Liu dav...@databricks.com wrote: It seems that the jar for cassandra is not loaded, you should have them in the classpath. On Mon, Feb 16, 2015 at 12:08 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: Hello all, Trying the example code from

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Davies Liu
It seems that the jar for cassandra is not loaded, you should have them in the classpath. On Mon, Feb 16, 2015 at 12:08 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: Hello all, Trying the example code from this package (https://github.com/Parsely/pyspark-cassandra) , I always get

Re: spark-local dir running out of space during long ALS run

2015-02-16 Thread Davies Liu
For the last question, you can trigger GC in JVM from Python by : sc._jvm.System.gc() On Mon, Feb 16, 2015 at 4:08 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: thanks, that looks promissing but can't find any reference giving me more details - can you please point me to something? Also

Re: Shuffle on joining two RDDs

2015-02-16 Thread Davies Liu
? On 2015-02-12 19:27, Davies Liu wrote: The feature works as expected in Scala/Java, but not implemented in Python. On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid iras...@cloudera.com wrote: I wonder if the issue is that these lines just need to add preservesPartitioning = true ? https

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Davies Liu
a lot, Mohamed. On Mon, Feb 16, 2015 at 5:46 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: Oh, I don't know. thanks a lot Davies, gonna figure that out now On Mon, Feb 16, 2015 at 5:31 PM, Davies Liu dav...@databricks.com wrote: It also need the Cassandra jar

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-13 Thread Davies Liu
large -- I've now split it up into many smaller operations but it's still not quite there -- see http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html Thanks, Rok On Wed, Feb 11, 2015, 19:59 Davies Liu dav...@databricks.com wrote: Could you share

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com wrote: I was having trouble with memory exceptions when broadcasting a large lookup table, so I've resorted to processing it iteratively -- but how can I modify an RDD iteratively? I'm trying something like : rdd =

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
are you comparing? PySpark will try to combine the multiple map() together, then you will get a task which need all the lookup_tables (the same size as before). You could add a checkpoint after some of the iterations. On Feb 11, 2015, at 8:11 PM, Davies Liu dav...@databricks.com wrote: On Wed

Re: Spark on very small files, appropriate use case?

2015-02-10 Thread Davies Liu
Spark is an framework to do things in parallel very easy, it definitely will help your cases. def read_file(path): lines = open(path).readlines() # bzip2 return lines filesRDD = sc.parallelize(path_to_files, N) lines = filesRDD.flatMap(read_file) Then you could do other transforms on

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Davies Liu
-- but the dictionary is large, it's 8 Gb pickled on disk. On Feb 10, 2015, at 10:01 PM, Davies Liu dav...@databricks.com wrote: Could you paste the NPE stack trace here? It will better to create a JIRA for it, thanks! On Tue, Feb 10, 2015 at 10:42 AM, rok rokros...@gmail.com wrote: I'm trying

Re: Define size partitions

2015-01-30 Thread Davies Liu
I think the new API sc. binaryRecords [1] (added in 1.2) can help in this case. [1] http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords Davies On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Hi, I want to process some

Re: Error when get data from hive table. Use python code.

2015-01-29 Thread Davies Liu
On Thu, Jan 29, 2015 at 6:36 PM, QiuxuanZhu ilsh1...@gmail.com wrote: Dear all, I have no idea when it raises an error when I run the following code. def getRow(data): return data.msg first_sql = select * from logs.event where dt = '20150120' and et = 'ppc' LIMIT 10#error

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Davies Liu
, Rok On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu dav...@databricks.com wrote: Maybe it's caused by integer overflow, is it possible that one object or batch bigger than 2G (after pickling)? On Tue, Jan 27, 2015 at 7:59 AM, rok rokros...@gmail.com wrote: I've got an dataset saved

Re: [documentation] Update the python example ALS of the site?

2015-01-27 Thread Davies Liu
will be fixed by https://github.com/apache/spark/pull/4226 On Tue, Jan 27, 2015 at 8:17 AM, gen tang gen.tan...@gmail.com wrote: Hi, In the spark 1.2.0, it requires the ratings should be a RDD of Rating or tuple or list. However, the current example in the site use still RDD[array] as the

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-27 Thread Davies Liu
Maybe it's caused by integer overflow, is it possible that one object or batch bigger than 2G (after pickling)? On Tue, Jan 27, 2015 at 7:59 AM, rok rokros...@gmail.com wrote: I've got an dataset saved with saveAsPickleFile using pyspark -- it saves without problems. When I try to read it back

Re: Large number of pyspark.daemon processes

2015-01-23 Thread Davies Liu
It should be a bug, the Python worker did not exit normally, could you file a JIRA for this? Also, could you show how to reproduce this behavior? On Fri, Jan 23, 2015 at 11:45 PM, Sven Krasser kras...@gmail.com wrote: Hey Adam, I'm not sure I understand just yet what you have in mind. My

Re: Using third party libraries in pyspark

2015-01-22 Thread Davies Liu
You need to install these libraries on all the slaves, or submit via spark-submit: spark-submit --py-files xxx On Thu, Jan 22, 2015 at 11:23 AM, Mohit Singh mohit1...@gmail.com wrote: Hi, I might be asking something very trivial, but whats the recommend way of using third party libraries.

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Davies Liu
We have not meet this issue, so not sure there are bugs related to reused worker or not. Could provide more details about it? On Wed, Jan 21, 2015 at 2:27 AM, critikaled isasmani@gmail.com wrote: I'm also facing the same issue. is this a bug? -- View this message in context:

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Davies Liu
among all the tasks within the same executor. 2015-01-21 15:04 GMT+08:00 Davies Liu dav...@databricks.com: Maybe some change related to serialize the closure cause LogParser is not a singleton any more, then it is initialized for every task. Could you change it to a Broadcast? On Tue, Jan 20

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Davies Liu
for it, thanks! On Wed, Jan 21, 2015 at 4:56 PM, Tassilo Klein tjkl...@gmail.com wrote: I set spark.python.worker.reuse = false and now it seems to run longer than before (it has not crashed yet). However, it is very very slow. How to proceed? On Wed, Jan 21, 2015 at 2:21 AM, Davies Liu dav

Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Davies Liu
If the dataset is not huge (in a few GB), you can setup NFS instead of HDFS (which is much harder to setup): 1. export a directory in master (or anyone in the cluster) 2. mount it in the same position across all slaves 3. read/write from it by file:///path/to/monitpoint On Tue, Jan 20, 2015 at

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Davies Liu
Could you provide a short script to reproduce this issue? On Tue, Jan 20, 2015 at 9:00 PM, TJ Klein tjkl...@gmail.com wrote: Hi, I just recently tried to migrate from Spark 1.1 to Spark 1.2 - using PySpark. Initially, I was super glad, noticing that Spark 1.2 is way faster than Spark 1.1.

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Davies Liu
. The theano function itself is a broadcast variable. Let me know if you need more information. Best, Tassilo On Wed, Jan 21, 2015 at 1:17 AM, Davies Liu dav...@databricks.com wrote: Could you provide a short script to reproduce this issue? On Tue, Jan 20, 2015 at 9:00 PM, TJ Klein tjkl

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-20 Thread Davies Liu
Maybe some change related to serialize the closure cause LogParser is not a singleton any more, then it is initialized for every task. Could you change it to a Broadcast? On Tue, Jan 20, 2015 at 10:39 PM, Fengyun RAO raofeng...@gmail.com wrote: Currently we are migrating from spark 1.1 to spark

Re: Scala vs Python performance differences

2015-01-16 Thread Davies Liu
Hey Phil, Thank you sharing this. The result didn't surprise me a lot, it's normal to do the prototype in Python, once it get stable and you really need the performance, then rewrite part of it in C or whole of it in another language does make sense, it will not cause you much time. Davies On

Re: Processing .wav files in PySpark

2015-01-16 Thread Davies Liu
I think you can not use textFile() or binaryFile() or pickleFile() here, it's different format than wav. You could get a list of paths for all the files, then sc.parallelize(), and foreach(): def process(path): # use subprocess to launch a process to do the job, read the stdout as result

Re: spark crashes on second or third call first() on file

2015-01-15 Thread Davies Liu
What's the version of Spark you are using? On Wed, Jan 14, 2015 at 12:00 AM, Linda Terlouw linda.terl...@icris.nl wrote: I'm new to Spark. When I use the Movie Lens dataset 100k (http://grouplens.org/datasets/movielens/), Spark crashes when I run the following code. The first call to

Re: save spark streaming output to single file on hdfs

2015-01-13 Thread Davies Liu
On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It does not need to a single file, Spark can pick any directory as a single RDD. Also, it's easy to union

Re: save spark streaming output to single file on hdfs

2015-01-13 Thread Davies Liu
13 2015 at 18:15:15 Davies Liu dav...@databricks.com wrote: On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It does not need to a single file

Re: Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Davies Liu
In the current implementation of TorrentBroadcast, the blocks are fetched one-by-one in single thread, so it can not fully utilize the network bandwidth. Davies On Fri, Jan 9, 2015 at 2:11 AM, Jun Yang yangjun...@gmail.com wrote: Guys, I have a question regarding to Spark 1.1 broadcast

Re: Shuffle Problems in 1.2.0

2015-01-07 Thread Davies Liu
:29 PM, Davies Liu dav...@databricks.com wrote: I still can not reproduce it with 2 nodes (4 CPUs). Your repro.py could be faster (10 min) than before (22 min): inpdata.map(lambda (pc, x): (x, pc=='p' and 2 or 1)).reduceByKey(lambda x, y: x|y).filter(lambda (x, pc): pc==3).collect() (also

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Davies Liu
I had ran your scripts in 5 nodes ( 2 CPUs, 8G mem) cluster, can not reproduce your failure. Should I test it with big memory node? On Mon, Jan 5, 2015 at 4:00 PM, Sven Krasser kras...@gmail.com wrote: Thanks for the input! I've managed to come up with a repro of the error with test data only

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Davies Liu
at 12:46 AM, Davies Liu dav...@databricks.com wrote: I had ran your scripts in 5 nodes ( 2 CPUs, 8G mem) cluster, can not reproduce your failure. Should I test it with big memory node? On Mon, Jan 5, 2015 at 4:00 PM, Sven Krasser kras...@gmail.com wrote: Thanks for the input! I've managed

Re: python: module pyspark.daemon not found

2014-12-30 Thread Davies Liu
Could you share a link about this? It's common to use Java 7, that will be nice if we can fix this. On Mon, Dec 29, 2014 at 1:27 PM, Eric Friedman eric.d.fried...@gmail.com wrote: Was your spark assembly jarred with Java 7? There's a known issue with jar files made with that version. It

Re: Python:Streaming Question

2014-12-30 Thread Davies Liu
There is a known bug with local scheduler, will be fixed by https://github.com/apache/spark/pull/3779 On Sun, Dec 21, 2014 at 10:57 PM, Samarth Mailinglist mailinglistsama...@gmail.com wrote: I’m trying to run the stateful network word count at

Re: spark streaming python + kafka

2014-12-22 Thread Davies Liu
There is a WIP pull request[1] working on this, it should be merged into master soon. [1] https://github.com/apache/spark/pull/3715 On Fri, Dec 19, 2014 at 2:15 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I've just seen that streaming spark supports python from 1.2 version.

Re: Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread Davies Liu
It's a bug, could you file a JIRA for this? thanks! On Tue, Dec 16, 2014 at 5:49 AM, sahanbull sa...@skimlinks.com wrote: Hi Guys, Im running a spark cluster in AWS with Spark 1.1.0 in EC2 I am trying to convert a an RDD with tuple (u'string', int , {(int, int): int, (int, int): int})

Re: Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread Davies Liu
I had created https://issues.apache.org/jira/browse/SPARK-4866, it will be fixed by https://github.com/apache/spark/pull/3714. Thank you for reporting this. Davies On Tue, Dec 16, 2014 at 12:44 PM, Davies Liu dav...@databricks.com wrote: It's a bug, could you file a JIRA for this? thanks

Re: Can spark job have sideeffects (write files to FileSystem)

2014-12-15 Thread Davies Liu
Thinking about that any task could be launched concurrently in different nodes, so in order to make sure the generated files are valid, you need some atomic operation (such as rename) to do it. For example, you could generate a random name for output file, writing the data into it, rename it to

Re: PySprak and UnsupportedOperationException

2014-12-09 Thread Davies Liu
On Tue, Dec 9, 2014 at 11:32 AM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: While trying simple examples of PySpark code, I systematically get these failures when I try this.. I dont see any prior exceptions in the output... How can I debug further to find root cause? es_rdd =

Re: Error when mapping a schema RDD when converting lists

2014-12-08 Thread Davies Liu
This is fixed in 1.2. Also, in 1.2+ you could call row.asDict() to convert the Row object into dict. On Mon, Dec 8, 2014 at 6:38 AM, sahanbull sa...@skimlinks.com wrote: Hi Guys, I used applySchema to store a set of nested dictionaries and lists in a parquet file.

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Davies Liu
Could you post you script to reproduce the results (also how to generate the dataset)? That will help us to investigate it. On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hmm, here I use spark on local mode on my laptop with 8 cores. The data is on my local

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-04 Thread Davies Liu
Which version of Spark are you using? inferSchema() is improved to support empty dict in 1.2+, could you try the 1.2-RC1? Also, you can use applySchema(): from pyspark.sql import * fields = [StructField('field1', IntegerType(), True), StructField('field2', StringType(), True),

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-03 Thread Davies Liu
inferSchema() will work better than jsonRDD() in your case, from pyspark.sql import Row srdd = sqlContext.inferSchema(rdd.map(lambda x: Row(**x))) srdd.first() Row( field1=5, field2='string', field3={'a'=1, 'c'=2}) On Wed, Dec 3, 2014 at 12:11 AM, sahanbull sa...@skimlinks.com wrote: Hi

Re: cannot submit python files on EC2 cluster

2014-12-03 Thread Davies Liu
On Wed, Dec 3, 2014 at 8:17 PM, chocjy jiyanyan...@gmail.com wrote: Hi, I am using spark with version number 1.1.0 on an EC2 cluster. After I submitted the job, it returned an error saying that a python module cannot be loaded due to missing files. I am using the same command that used to

Re: numpy arrays and spark sql

2014-12-01 Thread Davies Liu
applySchema() only accept RDD of Row/list/tuple, it does not work with numpy.array. After applySchema(), the Python RDD will be pickled and unpickled in JVM, so you will not have any benefit by using numpy.array. It will work if you convert ndarray into list: schemaRDD =

Re: Python Scientific Libraries in Spark

2014-11-24 Thread Davies Liu
These libraries could be used in PySpark easily. For example, MLlib uses Numpy heavily, it can accept np.array or sparse matrix in SciPy as vectors. On Mon, Nov 24, 2014 at 10:56 AM, Rohit Pujari rpuj...@hortonworks.com wrote: Hello Folks: Since spark exposes python bindings and allows you to

Re: Pyspark Error

2014-11-18 Thread Davies Liu
It seems that `localhost` can not be resolved in your machines, I had filed https://issues.apache.org/jira/browse/SPARK-4475 to track it. On Tue, Nov 18, 2014 at 6:10 AM, amin mohebbi aminn_...@yahoo.com.invalid wrote: Hi there, I have already downloaded Pre-built spark-1.1.0, I want to run

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Davies Liu
On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com wrote: Use zipWithIndex but cache the data before you run zipWithIndex...that way your ordering will be consistent (unless the bug has been fixed where you don't have to cache the data)... Could you point some link about

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Davies Liu
I see, thanks! On Tue, Nov 18, 2014 at 12:12 PM, Sean Owen so...@cloudera.com wrote: On Tue, Nov 18, 2014 at 8:26 PM, Davies Liu dav...@databricks.com wrote: On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com wrote: Use zipWithIndex but cache the data before you run

Re: Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Davies Liu
rdd1.union(rdd2).groupByKey() On Thu, Nov 13, 2014 at 3:41 AM, Blind Faith person.of.b...@gmail.com wrote: Let us say I have the following two RDDs, with the following key-pair values. rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ] and rdd2 = [ (key1, [value5,

Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-13 Thread Davies Liu
worker nodes with a total of about 80 cores. Thanks again for the tips! On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] [hidden email] wrote: Could you tell how large is the data set? It will help us to debug this issue. On Thu, Nov 6, 2014 at 10:39 AM, skane [hidden

Re: pyspark and hdfs file name

2014-11-13 Thread Davies Liu
One option maybe call HDFS tools or client to rename them after saveAsXXXFile(). On Thu, Nov 13, 2014 at 9:39 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I am running pyspark job. I need serialize final result to hdfs in binary files and having ability to give a name for output

Re: Using a compression codec in saveAsSequenceFile in Pyspark (Python API)

2014-11-13 Thread Davies Liu
You could use the following as compressionCodecClass: DEFLATE org.apache.hadoop.io.compress.DefaultCodec gzip org.apache.hadoop.io.compress.GzipCodec bzip2 org.apache.hadoop.io.compress.BZip2Codec LZO com.hadoop.compression.lzo.LzopCodec for gzip,

Re: Getting py4j.protocol.Py4JError: An error occurred while calling o39.predict. while doing batch prediction using decision trees

2014-11-12 Thread Davies Liu
This is a bug, will be fixed by https://github.com/apache/spark/pull/3230 On Wed, Nov 12, 2014 at 7:20 AM, rprabhu rpra...@ufl.edu wrote: Hello, I'm trying to run a classification task using mllib decision trees. After successfully training the model, I was trying to test the model using some

<    1   2   3   4   >