Please file a JIRA for it.
On Mon, Jun 15, 2015 at 8:00 AM, mrm ma...@skimlinks.com wrote:
Hi all,
I was looking for an explanation on the number of partitions for a joined
rdd.
The documentation of Spark 1.3.1. says that:
For distributed shuffle operations like reduceByKey and join, the
Maybe it's related to a bug, which is fixed by
https://github.com/apache/spark/pull/6558 recently.
On Fri, Jun 12, 2015 at 5:38 AM, Bipin Nag bipin@gmail.com wrote:
Hi Cheng,
Yes, some rows contain unit instead of decimal values. I believe some rows
from original source I had don't have
I think it works in Python
```
df = sqlContext.createDataFrame([(1, {'a': 1})])
df.printSchema()
root
|-- _1: long (nullable = true)
|-- _2: map (nullable = true)
||-- key: string
||-- value: long (valueContainsNull = true)
df.select(df._2.getField('a')).show()
+-+
|_2[a]|
crashes.
On Tue, Jun 2, 2015 at 5:06 AM, Davies Liu dav...@databricks.com wrote:
Could you run the single thread version in worker machine to make sure
that OpenCV is installed and configured correctly?
On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga sammiest...@gmail.com
wrote:
I've verified
, 2015 at 2:43 PM, Davies Liu dav...@databricks.com wrote:
Please file a bug here: https://issues.apache.org/jira/browse/SPARK/
Could you also provide a way to reproduce this bug (including some
datasets)?
On Thu, Jun 4, 2015 at 11:30 PM, Sam Stoelinga sammiest...@gmail.com
wrote:
I've
to crash the whole python
executor.
On Fri, May 29, 2015 at 2:06 AM, Davies Liu dav...@databricks.com wrote:
Could you try to comment out some lines in
`extract_sift_features_opencv` to find which line cause the crash?
If the bytes came from sequenceFile() is broken, it's easy to crash a
C
The second one sounds reasonable, I think.
On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
Hi everyone,
Let's assume I have a complex workflow of more than 10 datasources as input
- 20 computations (some creating intermediary datasets and some merging
No, all of the RDDs (including those returned from randomSplit()) are read-only.
On Mon, Apr 27, 2015 at 11:28 AM, Pagliari, Roberto
rpagli...@appcomsci.com wrote:
Suppose I have something like the code below
for idx in xrange(0, 10):
train_test_split =
, May 29, 2015 at 2:46 PM Davies Liu dav...@databricks.com wrote:
There is another implementation of RDD interface in Python, called
DPark [1], Could you have a few words to compare these two?
[1] https://github.com/douban/dpark/
On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss s...@svenkreiss.com
There is another implementation of RDD interface in Python, called
DPark [1], Could you have a few words to compare these two?
[1] https://github.com/douban/dpark/
On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss s...@svenkreiss.com wrote:
I wanted to share a Python implementation of RDDs:
Could you try to comment out some lines in
`extract_sift_features_opencv` to find which line cause the crash?
If the bytes came from sequenceFile() is broken, it's easy to crash a
C library in Python (OpenCV).
On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga sammiest...@gmail.com wrote:
Hi
This should be the case that you run different versions for Python in
driver and slaves, Spark 1.4 will double check that will release
soon).
SPARK_PYTHON should be PYSPARK_PYTHON
On Tue, May 26, 2015 at 11:21 AM, Nikhil Muralidhar nmural...@gmail.com wrote:
Hello,
I am trying to run a
Could you show up the schema and confirm that they are LongType?
df.printSchema()
On Mon, Apr 27, 2015 at 5:44 AM, jamborta jambo...@gmail.com wrote:
hi all,
I have just come across a problem where I have a table that has a few bigint
columns, it seems if I read that table into a dataframe
Could you try with specify PYSPARK_PYTHON to the path of python in
your virtual env, for example
PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py
On Mon, Apr 20, 2015 at 12:51 AM, Karlson ksonsp...@siberie.de wrote:
Hi all,
I am running the Python process that communicates with
Spark is a great framework to do things in parallel with multiple machines,
will be really helpful for your case.
Once you can wrap your entire pipeline into a single Python function:
def process_document(path, text):
# you can call other tools or services here
return xxx
then you can
AM, Davies Liu dav...@databricks.com wrote:
SparkContext can be used in multiple threads (Spark streaming works
with multiple threads), for example:
import threading
import time
def show(x):
time.sleep(1)
print x
def job():
sc.parallelize(range(100)).foreach(show
The docs had been updated.
You should convert the DataFrame to RDD by `df.rdd`
On Mon, Apr 20, 2015 at 5:23 AM, ayan guha guha.a...@gmail.com wrote:
Hi
Just upgraded to Spark 1.3.1.
I am getting an warning
Warning (from warnings module):
File
W dniu 19.05.2015 o 23:56, Davies Liu pisze:
It surprises me, could you list the owner information of
/mnt/lustre/bigdata/med_home/tmp/test19EE/ ?
On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes
tomasz.frub...@fuw.edu.pl mailto:tomasz.frub...@fuw.edu.pl
PySpark work with CPython by default, and you can specify which
version of Python to use by:
PYSPARK_PYTHON=path/to/path bin/spark-submit xxx.py
When you do the upgrade, you could install python 2.7 on every machine
in the cluster, test it with
PYSPARK_PYTHON=python2.7 bin/spark-submit xxx.py
It surprises me, could you list the owner information of
/mnt/lustre/bigdata/med_home/tmp/test19EE/ ?
On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes
tomasz.frub...@fuw.edu.pl wrote:
Dear Experts,
we have a spark cluster (standalone mode) in which master and workers are
started from root
SparkContext can be used in multiple threads (Spark streaming works
with multiple threads), for example:
import threading
import time
def show(x):
time.sleep(1)
print x
def job():
sc.parallelize(range(100)).foreach(show)
threading.Thread(target=job).start()
On Mon, May 18,
In PySpark, it serializes the functions/closures together with used
global values.
For example,
global_param = 111
def my_map(x):
return x + global_param
rdd.map(my_map)
- Davies
On Mon, May 18, 2015 at 7:26 AM, Oleg Ruchovets oruchov...@gmail.com wrote:
Hi ,
I am looking a way to
Yes, it's a bug, please file a JIRA.
On Sun, May 3, 2015 at 10:36 AM, Ali Bajwa ali.ba...@gmail.com wrote:
Friendly reminder on this one. Just wanted to get a confirmation that this
is not by design before I logged a JIRA
Thanks!
Ali
On Tue, Apr 28, 2015 at 9:53 AM, Ali Bajwa
The python workers used for each stage may be different, this may not
work as expected.
You can create a Random object, set the seed, use it to do the shuffle().
r = random.Random()
r.seek(my_seed)
def f(x):
r.shuffle(l)
rdd.map(f)
On Thu, May 14, 2015 at 6:21 AM, Charles Hayden
In Spark 1.3+, PySpark also support this kind of narrow dependencies,
for example,
N = 10
a1 = a.partitionBy(N)
b1 = b.partitionBy(N)
then a1.union(b1) will only have N partitions.
So, a1.join(b1) do not need shuffle anymore.
On Thu, Apr 9, 2015 at 11:57 AM, pop xia...@adobe.com wrote:
In
I will look into this today.
On Wed, Apr 8, 2015 at 7:35 AM, Stefano Parmesan parme...@spaziodati.eu wrote:
Did anybody by any chance had a look at this bug? It keeps on happening to
me, and it's quite blocking, I would like to understand if there's something
wrong in what I'm doing, or
This will be fixed in https://github.com/apache/spark/pull/5230/files
On Fri, Mar 27, 2015 at 9:13 AM, Peter Mac peter.machar...@noaa.gov wrote:
I downloaded spark version spark-1.3.0-bin-hadoop2.4.
When the python version of sql.py is run the following error occurs:
[root@nde-dev8-template
at 10:41 AM, Eduardo Cusa
eduardo.c...@usmediaconsulting.com wrote:
Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory.
I ran the same code as before, I need to make any changes?
On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu dav...@databricks.com wrote:
With batchSize = 1, I
(spark.kryoserializer.buffer.mb,512))
sc = SparkContext(conf=conf )
sqlContext = SQLContext(sc)
On Thu, Mar 26, 2015 at 2:29 PM, Davies Liu dav...@databricks.com wrote:
Could you try to remove the line `log2.cache()` ?
On Thu, Mar 26, 2015 at 10:02 AM, Eduardo Cusa
eduardo.c
What's the version of Spark you are running?
There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,
[1] https://issues.apache.org/jira/browse/SPARK-6055
On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa
eduardo.c...@usmediaconsulting.com wrote:
Hi Guys, I running the following
batchsize parameter = 1
http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html
if this does not work I will install 1.2.1 or 1.3
Regards
On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote:
What's the version of Spark you are running
Maybe this is related to a bug in 1.2 [1], it's fixed in 1.2.2 (not
released), could checkout the 1.2 branch and verify that?
[1] https://issues.apache.org/jira/browse/SPARK-5788
On Fri, Mar 20, 2015 at 3:21 AM, mrm ma...@skimlinks.com wrote:
Hi,
I recently changed from Spark 1.1. to Spark
the error log.
On Thu, Mar 19, 2015 at 8:03 PM, Davies Liu dav...@databricks.com wrote:
You could submit additional Python source via --py-files , for example:
$ bin/spark-submit --py-files work.py main.py
On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez guilla...@databerries.com
wrote:
Hello guys
Is it possible that `spark.local.dir` is overriden by others? The docs say:
NOTE: In Spark 1.0 and later this will be overriden by
SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN)
On Sat, Mar 14, 2015 at 5:29 PM, Peng Xia sparkpeng...@gmail.com wrote:
Hi Sean,
Thank very much for
the options of spark-submit should come before main.py, or they will
become the options of main.py, so it should be:
../hadoop/spark-install/bin/spark-submit --py-files
/home/poiuytrez/naive.py,/home/poiuytrez/processing.py,/home/poiuytrez/settings.py
--master spark://spark-m:7077 main.py
On Mon, Mar 16, 2015 at 6:23 AM, kevindahl kevin.d...@gmail.com wrote:
kevindahl wrote
I'm trying to create a spark data frame from a pandas data frame, but for
even the most trivial of datasets I get an error along the lines of this:
You could submit additional Python source via --py-files , for example:
$ bin/spark-submit --py-files work.py main.py
On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez guilla...@databerries.com wrote:
Hello guys,
I am having a hard time to understand how spark-submit behave with multiple
files. I
sc.wholeTextFile() is what you need.
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles
On Thu, Mar 12, 2015 at 9:26 AM, yh18190 yh18...@gmail.com wrote:
Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark +
Beautifulsoup to
?filter=-1
On Tue, Aug 19, 2014 at 12:12 PM, Davies Liu dav...@databricks.com wrote:
This script run very well without your CSV file. Could download you
CSV file into local disks, and narrow down to the lines which triggle
this issue?
On Tue, Aug 19, 2014 at 12:02 PM, Aaron aaron.doss
@gmail.com wrote:
It is a simple text file.
I'm not using SQL. just doing a rdd.count() on it. Does the bug affect it?
On Friday, February 27, 2015, Davies Liu dav...@databricks.com wrote:
What is this dataset? text file or parquet file?
There is an issue with serialization in Spark SQL, which
What is this dataset? text file or parquet file?
There is an issue with serialization in Spark SQL, which will make it
very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will
be fixed very soon.
Davies
On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy
guillaume.c@gmail.com wrote:
Another way to see the Python docs:
$ export PYTHONPATH=$SPARK_HOME/python
$ pydoc pyspark.sql
On Tue, Feb 24, 2015 at 2:01 PM, Reynold Xin r...@databricks.com wrote:
The official documentation will be posted when 1.3 is released (early
March).
Right now, you can build the docs yourself by
How many executors you have per machine? It will be helpful if you
could list all the configs.
Could you also try to run it without persist? Caching do hurt than
help, if you don't have enough memory.
On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier...@gmail.com wrote:
Thanks for the
On Thu, Feb 19, 2015 at 7:57 AM, jamborta jambo...@gmail.com wrote:
Hi all,
I think I have run into an issue on the lazy evaluation of variables in
pyspark, I have to following
functions = [func1, func2, func3]
for counter in range(len(functions)):
data = data.map(lambda value:
Currently, PySpark can not support pickle a class object in current
script ( '__main__'), the workaround could be put the implementation
of the class into a separate module, then use bin/spark-submit
--py-files xxx.py in deploy it.
in xxx.py:
class test(object):
def __init__(self, a, b):
.
On Mon, Feb 16, 2015 at 4:04 PM, Davies Liu dav...@databricks.com wrote:
It seems that the jar for cassandra is not loaded, you should have
them in the classpath.
On Mon, Feb 16, 2015 at 12:08 PM, Mohamed Lrhazi
mohamed.lrh...@georgetown.edu wrote:
Hello all,
Trying the example code from
It seems that the jar for cassandra is not loaded, you should have
them in the classpath.
On Mon, Feb 16, 2015 at 12:08 PM, Mohamed Lrhazi
mohamed.lrh...@georgetown.edu wrote:
Hello all,
Trying the example code from this package
(https://github.com/Parsely/pyspark-cassandra) , I always get
For the last question, you can trigger GC in JVM from Python by :
sc._jvm.System.gc()
On Mon, Feb 16, 2015 at 4:08 PM, Antony Mayi
antonym...@yahoo.com.invalid wrote:
thanks, that looks promissing but can't find any reference giving me more
details - can you please point me to something? Also
?
On 2015-02-12 19:27, Davies Liu wrote:
The feature works as expected in Scala/Java, but not implemented in
Python.
On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid iras...@cloudera.com
wrote:
I wonder if the issue is that these lines just need to add
preservesPartitioning = true
?
https
a lot,
Mohamed.
On Mon, Feb 16, 2015 at 5:46 PM, Mohamed Lrhazi
mohamed.lrh...@georgetown.edu wrote:
Oh, I don't know. thanks a lot Davies, gonna figure that out now
On Mon, Feb 16, 2015 at 5:31 PM, Davies Liu dav...@databricks.com wrote:
It also need the Cassandra jar
large -- I've
now split it up into many smaller operations but it's still not quite there
-- see
http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html
Thanks,
Rok
On Wed, Feb 11, 2015, 19:59 Davies Liu dav...@databricks.com wrote:
Could you share
On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com wrote:
I was having trouble with memory exceptions when broadcasting a large lookup
table, so I've resorted to processing it iteratively -- but how can I modify
an RDD iteratively?
I'm trying something like :
rdd =
are you comparing?
PySpark will try to combine the multiple map() together, then you will get
a task which need all the lookup_tables (the same size as before).
You could add a checkpoint after some of the iterations.
On Feb 11, 2015, at 8:11 PM, Davies Liu dav...@databricks.com wrote:
On Wed
Spark is an framework to do things in parallel very easy, it
definitely will help your cases.
def read_file(path):
lines = open(path).readlines() # bzip2
return lines
filesRDD = sc.parallelize(path_to_files, N)
lines = filesRDD.flatMap(read_file)
Then you could do other transforms on
-- but the dictionary is large, it's 8 Gb pickled on disk.
On Feb 10, 2015, at 10:01 PM, Davies Liu dav...@databricks.com wrote:
Could you paste the NPE stack trace here? It will better to create a
JIRA for it, thanks!
On Tue, Feb 10, 2015 at 10:42 AM, rok rokros...@gmail.com wrote:
I'm trying
I think the new API sc. binaryRecords [1] (added in 1.2) can help in this case.
[1]
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords
Davies
On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com wrote:
Hi,
I want to process some
On Thu, Jan 29, 2015 at 6:36 PM, QiuxuanZhu ilsh1...@gmail.com wrote:
Dear all,
I have no idea when it raises an error when I run the following code.
def getRow(data):
return data.msg
first_sql = select * from logs.event where dt = '20150120' and et = 'ppc'
LIMIT 10#error
,
Rok
On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu dav...@databricks.com wrote:
Maybe it's caused by integer overflow, is it possible that one object
or batch bigger than 2G (after pickling)?
On Tue, Jan 27, 2015 at 7:59 AM, rok rokros...@gmail.com wrote:
I've got an dataset saved
will be fixed by https://github.com/apache/spark/pull/4226
On Tue, Jan 27, 2015 at 8:17 AM, gen tang gen.tan...@gmail.com wrote:
Hi,
In the spark 1.2.0, it requires the ratings should be a RDD of Rating or
tuple or list. However, the current example in the site use still RDD[array]
as the
Maybe it's caused by integer overflow, is it possible that one object
or batch bigger than 2G (after pickling)?
On Tue, Jan 27, 2015 at 7:59 AM, rok rokros...@gmail.com wrote:
I've got an dataset saved with saveAsPickleFile using pyspark -- it saves
without problems. When I try to read it back
It should be a bug, the Python worker did not exit normally, could you
file a JIRA for this?
Also, could you show how to reproduce this behavior?
On Fri, Jan 23, 2015 at 11:45 PM, Sven Krasser kras...@gmail.com wrote:
Hey Adam,
I'm not sure I understand just yet what you have in mind. My
You need to install these libraries on all the slaves, or submit via
spark-submit:
spark-submit --py-files xxx
On Thu, Jan 22, 2015 at 11:23 AM, Mohit Singh mohit1...@gmail.com wrote:
Hi,
I might be asking something very trivial, but whats the recommend way of
using third party libraries.
We have not meet this issue, so not sure there are bugs related to
reused worker or not.
Could provide more details about it?
On Wed, Jan 21, 2015 at 2:27 AM, critikaled isasmani@gmail.com wrote:
I'm also facing the same issue.
is this a bug?
--
View this message in context:
among all
the tasks within the same executor.
2015-01-21 15:04 GMT+08:00 Davies Liu dav...@databricks.com:
Maybe some change related to serialize the closure cause LogParser is
not a singleton any more, then it is initialized for every task.
Could you change it to a Broadcast?
On Tue, Jan 20
for it, thanks!
On Wed, Jan 21, 2015 at 4:56 PM, Tassilo Klein tjkl...@gmail.com wrote:
I set spark.python.worker.reuse = false and now it seems to run longer than
before (it has not crashed yet). However, it is very very slow. How to
proceed?
On Wed, Jan 21, 2015 at 2:21 AM, Davies Liu dav
If the dataset is not huge (in a few GB), you can setup NFS instead of
HDFS (which is much harder to setup):
1. export a directory in master (or anyone in the cluster)
2. mount it in the same position across all slaves
3. read/write from it by file:///path/to/monitpoint
On Tue, Jan 20, 2015 at
Could you provide a short script to reproduce this issue?
On Tue, Jan 20, 2015 at 9:00 PM, TJ Klein tjkl...@gmail.com wrote:
Hi,
I just recently tried to migrate from Spark 1.1 to Spark 1.2 - using
PySpark. Initially, I was super glad, noticing that Spark 1.2 is way faster
than Spark 1.1.
. The theano function itself is a broadcast variable.
Let me know if you need more information.
Best,
Tassilo
On Wed, Jan 21, 2015 at 1:17 AM, Davies Liu dav...@databricks.com wrote:
Could you provide a short script to reproduce this issue?
On Tue, Jan 20, 2015 at 9:00 PM, TJ Klein tjkl
Maybe some change related to serialize the closure cause LogParser is
not a singleton any more, then it is initialized for every task.
Could you change it to a Broadcast?
On Tue, Jan 20, 2015 at 10:39 PM, Fengyun RAO raofeng...@gmail.com wrote:
Currently we are migrating from spark 1.1 to spark
Hey Phil,
Thank you sharing this. The result didn't surprise me a lot, it's normal to do
the prototype in Python, once it get stable and you really need the performance,
then rewrite part of it in C or whole of it in another language does make sense,
it will not cause you much time.
Davies
On
I think you can not use textFile() or binaryFile() or pickleFile()
here, it's different format than wav.
You could get a list of paths for all the files, then
sc.parallelize(), and foreach():
def process(path):
# use subprocess to launch a process to do the job, read the
stdout as result
What's the version of Spark you are using?
On Wed, Jan 14, 2015 at 12:00 AM, Linda Terlouw linda.terl...@icris.nl wrote:
I'm new to Spark. When I use the Movie Lens dataset 100k
(http://grouplens.org/datasets/movielens/), Spark crashes when I run the
following code. The first call to
On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote:
Hi all,
Is there a way to save dstream RDDs to a single file so that another process
can pick it up as a single RDD?
It does not need to a single file, Spark can pick any directory as a single RDD.
Also, it's easy to union
13 2015 at 18:15:15 Davies Liu dav...@databricks.com wrote:
On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote:
Hi all,
Is there a way to save dstream RDDs to a single file so that another
process
can pick it up as a single RDD?
It does not need to a single file
In the current implementation of TorrentBroadcast, the blocks are
fetched one-by-one
in single thread, so it can not fully utilize the network bandwidth.
Davies
On Fri, Jan 9, 2015 at 2:11 AM, Jun Yang yangjun...@gmail.com wrote:
Guys,
I have a question regarding to Spark 1.1 broadcast
:29 PM, Davies Liu dav...@databricks.com wrote:
I still can not reproduce it with 2 nodes (4 CPUs).
Your repro.py could be faster (10 min) than before (22 min):
inpdata.map(lambda (pc, x): (x, pc=='p' and 2 or
1)).reduceByKey(lambda x, y: x|y).filter(lambda (x, pc):
pc==3).collect()
(also
I had ran your scripts in 5 nodes ( 2 CPUs, 8G mem) cluster, can not
reproduce your failure. Should I test it with big memory node?
On Mon, Jan 5, 2015 at 4:00 PM, Sven Krasser kras...@gmail.com wrote:
Thanks for the input! I've managed to come up with a repro of the error with
test data only
at 12:46 AM, Davies Liu dav...@databricks.com wrote:
I had ran your scripts in 5 nodes ( 2 CPUs, 8G mem) cluster, can not
reproduce your failure. Should I test it with big memory node?
On Mon, Jan 5, 2015 at 4:00 PM, Sven Krasser kras...@gmail.com wrote:
Thanks for the input! I've managed
Could you share a link about this? It's common to use Java 7, that
will be nice if we can fix this.
On Mon, Dec 29, 2014 at 1:27 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
Was your spark assembly jarred with Java 7? There's a known issue with jar
files made with that version. It
There is a known bug with local scheduler, will be fixed by
https://github.com/apache/spark/pull/3779
On Sun, Dec 21, 2014 at 10:57 PM, Samarth Mailinglist
mailinglistsama...@gmail.com wrote:
I’m trying to run the stateful network word count at
There is a WIP pull request[1] working on this, it should be merged
into master soon.
[1] https://github.com/apache/spark/pull/3715
On Fri, Dec 19, 2014 at 2:15 AM, Oleg Ruchovets oruchov...@gmail.com wrote:
Hi ,
I've just seen that streaming spark supports python from 1.2 version.
It's a bug, could you file a JIRA for this? thanks!
On Tue, Dec 16, 2014 at 5:49 AM, sahanbull sa...@skimlinks.com wrote:
Hi Guys,
Im running a spark cluster in AWS with Spark 1.1.0 in EC2
I am trying to convert a an RDD with tuple
(u'string', int , {(int, int): int, (int, int): int})
I had created https://issues.apache.org/jira/browse/SPARK-4866, it
will be fixed by https://github.com/apache/spark/pull/3714.
Thank you for reporting this.
Davies
On Tue, Dec 16, 2014 at 12:44 PM, Davies Liu dav...@databricks.com wrote:
It's a bug, could you file a JIRA for this? thanks
Thinking about that any task could be launched concurrently in
different nodes, so in order to make sure the generated files are
valid, you need some atomic operation (such as rename) to do it. For
example, you could generate a random name for output file, writing the
data into it, rename it to
On Tue, Dec 9, 2014 at 11:32 AM, Mohamed Lrhazi
mohamed.lrh...@georgetown.edu wrote:
While trying simple examples of PySpark code, I systematically get these
failures when I try this.. I dont see any prior exceptions in the output...
How can I debug further to find root cause?
es_rdd =
This is fixed in 1.2. Also, in 1.2+ you could call row.asDict() to
convert the Row object into dict.
On Mon, Dec 8, 2014 at 6:38 AM, sahanbull sa...@skimlinks.com wrote:
Hi Guys,
I used applySchema to store a set of nested dictionaries and lists in a
parquet file.
Could you post you script to reproduce the results (also how to
generate the dataset)? That will help us to investigate it.
On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hmm, here I use spark on local mode on my laptop with 8 cores. The data is
on my local
Which version of Spark are you using? inferSchema() is improved to
support empty dict in 1.2+, could you try the 1.2-RC1?
Also, you can use applySchema():
from pyspark.sql import *
fields = [StructField('field1', IntegerType(), True),
StructField('field2', StringType(), True),
inferSchema() will work better than jsonRDD() in your case,
from pyspark.sql import Row
srdd = sqlContext.inferSchema(rdd.map(lambda x: Row(**x)))
srdd.first()
Row( field1=5, field2='string', field3={'a'=1, 'c'=2})
On Wed, Dec 3, 2014 at 12:11 AM, sahanbull sa...@skimlinks.com wrote:
Hi
On Wed, Dec 3, 2014 at 8:17 PM, chocjy jiyanyan...@gmail.com wrote:
Hi,
I am using spark with version number 1.1.0 on an EC2 cluster. After I
submitted the job, it returned an error saying that a python module cannot
be loaded due to missing files. I am using the same command that used to
applySchema() only accept RDD of Row/list/tuple, it does not work with
numpy.array.
After applySchema(), the Python RDD will be pickled and unpickled in
JVM, so you will not have any benefit by using numpy.array.
It will work if you convert ndarray into list:
schemaRDD =
These libraries could be used in PySpark easily. For example, MLlib
uses Numpy heavily, it can accept np.array or sparse matrix in SciPy
as vectors.
On Mon, Nov 24, 2014 at 10:56 AM, Rohit Pujari rpuj...@hortonworks.com wrote:
Hello Folks:
Since spark exposes python bindings and allows you to
It seems that `localhost` can not be resolved in your machines, I had
filed https://issues.apache.org/jira/browse/SPARK-4475 to track it.
On Tue, Nov 18, 2014 at 6:10 AM, amin mohebbi
aminn_...@yahoo.com.invalid wrote:
Hi there,
I have already downloaded Pre-built spark-1.1.0, I want to run
On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com wrote:
Use zipWithIndex but cache the data before you run zipWithIndex...that way
your ordering will be consistent (unless the bug has been fixed where you
don't have to cache the data)...
Could you point some link about
I see, thanks!
On Tue, Nov 18, 2014 at 12:12 PM, Sean Owen so...@cloudera.com wrote:
On Tue, Nov 18, 2014 at 8:26 PM, Davies Liu dav...@databricks.com wrote:
On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com
wrote:
Use zipWithIndex but cache the data before you run
rdd1.union(rdd2).groupByKey()
On Thu, Nov 13, 2014 at 3:41 AM, Blind Faith person.of.b...@gmail.com wrote:
Let us say I have the following two RDDs, with the following key-pair
values.
rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ]
and
rdd2 = [ (key1, [value5,
worker nodes
with a total of about 80 cores.
Thanks again for the tips!
On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List]
[hidden email] wrote:
Could you tell how large is the data set? It will help us to debug this
issue.
On Thu, Nov 6, 2014 at 10:39 AM, skane [hidden
One option maybe call HDFS tools or client to rename them after saveAsXXXFile().
On Thu, Nov 13, 2014 at 9:39 PM, Oleg Ruchovets oruchov...@gmail.com wrote:
Hi ,
I am running pyspark job.
I need serialize final result to hdfs in binary files and having ability to
give a name for output
You could use the following as compressionCodecClass:
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO com.hadoop.compression.lzo.LzopCodec
for gzip,
This is a bug, will be fixed by https://github.com/apache/spark/pull/3230
On Wed, Nov 12, 2014 at 7:20 AM, rprabhu rpra...@ufl.edu wrote:
Hello,
I'm trying to run a classification task using mllib decision trees. After
successfully training the model, I was trying to test the model using some
101 - 200 of 302 matches
Mail list logo