I will be streaming data and am trying to understand how to get rid of old
data from a stream so it does not become to large. I will stream in one
large table of buying data and join that to another table of different
data. I need the last 14 days from the second table. I will not need data
that is
I would like to see the full error. However, S3 can give misleading
messages if you don't have the correct permissions.
On Tue, Apr 24, 2018, 2:28 PM Marco Mistroni wrote:
> HI all
> i am using the following code for persisting data into S3 (aws keys are
> already stored in the environment vari
We are tasked with loading a big file (possibly 2TB) into a data warehouse.
In order to do this efficiently, we need to split the file into smaller
files.
I don't believe there is a way to do this with Spark, because in order for
Spark to distribute the file to the worker nodes, it first has to be
As part of my processing, I have the following code:
rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 10)
rdd.count()
The s3 directory has about 8GB of data and 61,878 files. I am using Spark
2.1, and running it with 15 modes of m3.xlarge nodes on EMR.
The job fails with this error:
ira/browse/SPARK-13330
>
>
>
> Holden Karau 于2017年4月5日周三 上午12:03写道:
>
>> Which version of Spark is this (or is it a dev build)? We've recently
>> made some improvements with PYTHONHASHSEED propagation.
>>
>> On Tue, Apr 4, 2017 at 7:49 AM Eike von Seg
So that means I have to pass that bash variable to the EMR clusters when I
spin them up, not afterwards. I'll give that a go.
Thanks!
Henry
On Tue, Apr 4, 2017 at 7:49 AM, Eike von Seggern
wrote:
> 2017-04-01 21:54 GMT+02:00 Paul Tremblay :
>
>> When I try to to do a groupBy
What do you want to do with the results of the query?
Henry
On Wed, Mar 29, 2017 at 12:00 PM, szep.laszlo.it
wrote:
> Hi,
>
> after I created a dataset
>
> Dataset df = sqlContext.sql("query");
>
> I need to have a result values and I call a method: collectAsList()
>
> List list = df.collectAsL
So if I am understanding your problem, you have the data in CSV files, but
the CSV files are gunzipped? If so Spark can read a gunzip file directly.
Sorry if I didn't understand your question.
Henry
On Mon, Apr 3, 2017 at 5:05 AM, Old-School
wrote:
> I have a dataset that contains DocID, WordID
d run the history server like:
> ```
> cd /usr/local/src/spark-1.6.1-bin-hadoop2.6
> sbin/start-history-server.sh
> ```
> and then open http://localhost:18080
>
>
>
>
> On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay
> wrote:
>
>> I am looking for tips on
When I try to to do a groupByKey() in my spark environment, I get the error
described here:
http://stackoverflow.com/questions/36798833/what-does-
exception-randomness-of-hash-of-string-should-be-disabled-via-pythonh
In order to attempt to fix the problem, I set up my ipython environment
with the
When I try to to do a groupByKey() in my spark environment, I get the error
described here:
http://stackoverflow.com/questions/36798833/what-does-exception-randomness-of-hash-of-string-should-be-disabled-via-pythonh
In order to attempt to fix the problem, I set up my ipython environment
with the
I am looking for tips on evaluating my Spark job after it has run.
I know that right now I can look at the history of jobs through the web ui.
I also know how to look at the current resources being used by a similar
web ui.
However, I would like to look at the logs after the job is finished to
ev
work as well:
http://michaelryanbell.com/processing-whole-files-spark-s3.html
Jon
On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <mailto:paulhtremb...@gmail.com>> wrote:
I've actually been able to trace the problem to the files being
read in. If I change to a different d
chine
On Feb 4, 2017 16:25, "Paul Tremblay" <mailto:paulhtremb...@gmail.com>> wrote:
I am using pyspark 2.1 and am wondering how to convert a flat
file, with one record per row, into a columnar format.
Here is an example of the data:
u'WARC/1.0
I've actually been able to trace the problem to the files being read in.
If I change to a different directory, then I don't get the error. Is one
of the executors running out of memory?
On 02/06/2017 02:35 PM, Paul Tremblay wrote:
When I try to create an rdd using wholeTextFiles
When I try to create an rdd using wholeTextFiles, I get an
incomprehensible error. But when I use the same path with sc.textFile, I
get no error.
I am using pyspark with spark 2.1.
in_path =
's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/
rdd = sc.wholeTextFiles(
I am using pyspark 2.1 and am wondering how to convert a flat file, with
one record per row, into a columnar format.
Here is an example of the data:
u'WARC/1.0',
u'WARC-Type: warcinfo',
u'WARC-Date: 2016-12-08T13:00:23Z',
u'WARC-Record-ID: ',
u'Content-Length: 344',
u'Content-Type: applicati
17 matches
Mail list logo