from:"\"Paul Tremblay\""

dropping unused data from a stream

2019-01-22 Thread Paul Tremblay

I will be streaming data and am trying to understand how to get rid of old data from a stream so it does not become to large. I will stream in one large table of buying data and join that to another table of different data. I need the last 14 days from the second table. I will not need data that is

Re: Problem in persisting file in S3 using Spark: xxx file does not exist Exception

2018-05-02 Thread Paul Tremblay

I would like to see the full error. However, S3 can give misleading messages if you don't have the correct permissions. On Tue, Apr 24, 2018, 2:28 PM Marco Mistroni wrote: > HI all > i am using the following code for persisting data into S3 (aws keys are > already stored in the environment vari

splitting a huge file

2017-04-21 Thread Paul Tremblay

We are tasked with loading a big file (possibly 2TB) into a data warehouse. In order to do this efficiently, we need to split the file into smaller files. I don't believe there is a way to do this with Spark, because in order for Spark to distribute the file to the worker nodes, it first has to be

small job runs out of memory using wholeTextFiles

2017-04-07 Thread Paul Tremblay

As part of my processing, I have the following code: rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 10) rdd.count() The s3 directory has about 8GB of data and 61,878 files. I am using Spark 2.1, and running it with 15 modes of m3.xlarge nodes on EMR. The job fails with this error:

Re: bug with PYTHONHASHSEED

2017-04-05 Thread Paul Tremblay

ira/browse/SPARK-13330 > > > > Holden Karau 于2017年4月5日周三上午12:03写道： > >> Which version of Spark is this (or is it a dev build)? We've recently >> made some improvements with PYTHONHASHSEED propagation. >> >> On Tue, Apr 4, 2017 at 7:49 AM Eike von Seg

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Paul Tremblay

So that means I have to pass that bash variable to the EMR clusters when I spin them up, not afterwards. I'll give that a go. Thanks! Henry On Tue, Apr 4, 2017 at 7:49 AM, Eike von Seggern wrote: > 2017-04-01 21:54 GMT+02:00 Paul Tremblay : > >> When I try to to do a groupBy

Re: Alternatives for dataframe collectAsList()

2017-04-03 Thread Paul Tremblay

What do you want to do with the results of the query? Henry On Wed, Mar 29, 2017 at 12:00 PM, szep.laszlo.it wrote: > Hi, > > after I created a dataset > > Dataset df = sqlContext.sql("query"); > > I need to have a result values and I call a method: collectAsList() > > List list = df.collectAsL

Re: Read file and represent rows as Vectors

2017-04-03 Thread Paul Tremblay

So if I am understanding your problem, you have the data in CSV files, but the CSV files are gunzipped? If so Spark can read a gunzip file directly. Sorry if I didn't understand your question. Henry On Mon, Apr 3, 2017 at 5:05 AM, Old-School wrote: > I have a dataset that contains DocID, WordID

Re: Looking at EMR Logs

2017-04-02 Thread Paul Tremblay

d run the history server like: > ``` > cd /usr/local/src/spark-1.6.1-bin-hadoop2.6 > sbin/start-history-server.sh > ``` > and then open http://localhost:18080 > > > > > On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay > wrote: > >> I am looking for tips on

bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay

When I try to to do a groupByKey() in my spark environment, I get the error described here: http://stackoverflow.com/questions/36798833/what-does- exception-randomness-of-hash-of-string-should-be-disabled-via-pythonh In order to attempt to fix the problem, I set up my ipython environment with the

pyspark bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay

When I try to to do a groupByKey() in my spark environment, I get the error described here: http://stackoverflow.com/questions/36798833/what-does-exception-randomness-of-hash-of-string-should-be-disabled-via-pythonh In order to attempt to fix the problem, I set up my ipython environment with the

Looking at EMR Logs

2017-03-30 Thread Paul Tremblay

I am looking for tips on evaluating my Spark job after it has run. I know that right now I can look at the history of jobs through the web ui. I also know how to look at the current resources being used by a similar web ui. However, I would like to look at the logs after the job is finished to ev

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Paul Tremblay

work as well: http://michaelryanbell.com/processing-whole-files-spark-s3.html Jon On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <mailto:paulhtremb...@gmail.com>> wrote: I've actually been able to trace the problem to the files being read in. If I change to a different d

Re: Turning rows into columns

2017-02-11 Thread Paul Tremblay

chine On Feb 4, 2017 16:25, "Paul Tremblay" <mailto:paulhtremb...@gmail.com>> wrote: I am using pyspark 2.1 and am wondering how to convert a flat file, with one record per row, into a columnar format. Here is an example of the data: u'WARC/1.0

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay

I've actually been able to trace the problem to the files being read in. If I change to a different directory, then I don't get the error. Is one of the executors running out of memory? On 02/06/2017 02:35 PM, Paul Tremblay wrote: When I try to create an rdd using wholeTextFiles

wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay

When I try to create an rdd using wholeTextFiles, I get an incomprehensible error. But when I use the same path with sc.textFile, I get no error. I am using pyspark with spark 2.1. in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/ rdd = sc.wholeTextFiles(

Turning rows into columns

2017-02-04 Thread Paul Tremblay

I am using pyspark 2.1 and am wondering how to convert a flat file, with one record per row, into a columnar format. Here is an example of the data: u'WARC/1.0', u'WARC-Type: warcinfo', u'WARC-Date: 2016-12-08T13:00:23Z', u'WARC-Record-ID: ', u'Content-Length: 344', u'Content-Type: applicati

dropping unused data from a stream

Re: Problem in persisting file in S3 using Spark: xxx file does not exist Exception

splitting a huge file

small job runs out of memory using wholeTextFiles

Re: bug with PYTHONHASHSEED

Re: bug with PYTHONHASHSEED

Re: Alternatives for dataframe collectAsList()

Re: Read file and represent rows as Vectors

Re: Looking at EMR Logs

bug with PYTHONHASHSEED

pyspark bug with PYTHONHASHSEED

Looking at EMR Logs

Re: wholeTextFiles fails, but textFile succeeds for same path

Re: Turning rows into columns

Re: wholeTextFiles fails, but textFile succeeds for same path

wholeTextFiles fails, but textFile succeeds for same path

Turning rows into columns

17 matches

Site Navigation

Mail list logo

Footer information