Spark Python with SequenceFile containing numpy deserialized data in str form

2015-06-08 Thread Sam Stoelinga
read back the original data. Will try converting the str to bytearray before storing it to a seqeencefile. Thanks, Sam Stoelinga

Re: Spark Python with SequenceFile containing numpy deserialized data in str form

2015-06-08 Thread Sam Stoelinga
) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) On Tue, Jun 9, 2015 at 11:04 AM, Sam Stoelinga sammiest...@gmail.com wrote: Hi all, I'm storing an rdd as sequencefile with the following content: key=filename(string) value=python str from numpy.savez(not unicode

Re: Spark Python with SequenceFile containing numpy deserialized data in str form

2015-06-08 Thread Sam Stoelinga
language usable SequenceFile instead of using Picklefile though, so if anybody has pointers would appreciate that :) On Tue, Jun 9, 2015 at 11:35 AM, Sam Stoelinga sammiest...@gmail.com wrote: Update: Using bytearray before storing to RDD is not a solution either. This happens when trying to read

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
. On Fri, Jun 5, 2015 at 2:17 PM, Sam Stoelinga sammiest...@gmail.com wrote: Yea should have emphasized that. I'm running the same code on the same VM. It's a VM with spark in standalone mode and I run the unit test directly on that same VM. So OpenCV is working correctly on that same machine

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
2, 2015 at 5:06 AM, Davies Liu dav...@databricks.com wrote: Could you run the single thread version in worker machine to make sure that OpenCV is installed and configured correctly? On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga sammiest...@gmail.com wrote: I've verified the issue lies

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
: Please file a bug here: https://issues.apache.org/jira/browse/SPARK/ Could you also provide a way to reproduce this bug (including some datasets)? On Thu, Jun 4, 2015 at 11:30 PM, Sam Stoelinga sammiest...@gmail.com wrote: I've changed the SIFT feature extraction to SURF feature extraction

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
Please ignore this whole thread. It's working out of nowhere. I'm not sure what was the root cause. After I restarted the VM the previous SIFT code also started working. On Fri, Jun 5, 2015 at 10:40 PM, Sam Stoelinga sammiest...@gmail.com wrote: Thanks Davies. I will file a bug later with code

Re: PySpark with OpenCV causes python worker to crash

2015-05-30 Thread Sam Stoelinga
? If the bytes came from sequenceFile() is broken, it's easy to crash a C library in Python (OpenCV). On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga sammiest...@gmail.com wrote: Hi sparkers, I am working on a PySpark application which uses the OpenCV library. It runs fine when running

Re: PySpark with OpenCV causes python worker to crash

2015-05-30 Thread Sam Stoelinga
.COLOR_BGR2GRAY) sift = cv2.xfeatures2d.SIFT_create() kp, descriptors = sift.detectAndCompute(gray, None) return (imgfilename, test) And corresponding tests.py: https://gist.github.com/samos123/d383c26f6d47d34d32d6 On Sat, May 30, 2015 at 8:04 PM, Sam Stoelinga sammiest...@gmail.com wrote

PySpark with OpenCV causes python worker to crash

2015-05-28 Thread Sam Stoelinga
This is the error message taken from STDERR of the worker log: https://gist.github.com/samos123/3300191684aee7fc8013 Would like pointers or tips on how to debug further? Would be nice to know the reason why the worker crashed. Thanks, Sam Stoelinga org.apache.spark.SparkException: Python worker exited

MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
. Looking forward to hear you point out my stupidity or provide work-arounds that could make Spark KMeans work well on large datasets. Regards, Sam Stoelinga

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
PM, Jeetendra Gangele gangele...@gmail.com wrote: How you are passing feature vector to K means? its in 2-D space of 1-D array? Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote: Hi Sparkers, I

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Guys, great feedback by pointing out my stupidity :D Rows and columns got intermixed hence the weird results I was seeing. Ignore my previous issues will reformat my data first. On Wed, Apr 29, 2015 at 8:47 PM, Sam Stoelinga sammiest...@gmail.com wrote: I'm mostly using example code, see here