Where can I get few GBs of sample data?

2017-09-28 Thread Gaurav1809
Hi All, I have setup multi node spark cluster and now looking for good volume of data to test and see how it works while processing the same. Can anyone provide pointers as to where can i get few GBs of free sample data? Thanks and regards, Gaurav -- Sent from:

Re: How to run MLlib's word2vec in CBOW mode?

2017-09-28 Thread Nick Pentreath
MLlib currently doesn't support CBOW - there is an open PR for it (see https://issues.apache.org/jira/browse/SPARK-20372). On Thu, 28 Sep 2017 at 09:56 pun wrote: > Hello, > My understanding is that word2vec can be ran in two modes: > >- continuous bag-of-words

Re: Where can I get few GBs of sample data?

2017-09-28 Thread Gourav Sengupta
Just out curiosity, why can you just generate the data randomly? I have used that mechanism and it helps a lot, in case you are just starting to use SPARK. Regards, Gourav Sengupta On Thu, Sep 28, 2017 at 5:04 PM, Gaurav1809 wrote: > Hi All, > > I have setup multi

[SPARK-SQL] Spark Persist slower than non-persist call.

2017-09-28 Thread sfbayeng
My settings are: Running Spark 2.1 on 3 node YARN cluster with 160 GB. Dynamic allocation turned on. spark.executor.memory=6G, spark.executor.cores=6 First, I am reading hive tables: orders(329MB) and lineitems(1.43GB) and doing left outer join. Next, I apply 7 different

Re: Where can I get few GBs of sample data?

2017-09-28 Thread Jörn Franke
I think just any Dataset is not useful. The data should be close to the real data that you want to process. Similarly, the processing should be the same as you plan. > On 28. Sep 2017, at 18:04, Gaurav1809 wrote: > > Hi All, > > I have setup multi node spark cluster

Re: More instances = slower Spark job

2017-09-28 Thread Jörn Franke
It looks to me a little bit strange. First json.gz files are single threaded, ie each file can only be processed by one thread (so it is good to have many files of around 128 MB to 512 MB size each). Then what you do in the code is already done by the data source. There is no need to read the

Re: Where can I get few GBs of sample data?

2017-09-28 Thread Sonal Goyal
Here are some links for public data sets https://aws.amazon.com/public-datasets/ https://www.springboard.com/blog/free-public-data-sets-data-science-project/ Thanks, Sonal Nube Technologies On Thu, Sep 28, 2017 at 9:34 PM,

RE: Where can I get few GBs of sample data?

2017-09-28 Thread Prem Moola
As mentioned earlier , just testing some random data for the sake of testing isn’t useful and wouldn’t really yield any meaningful information, with that being said here are some free resources for getting Data www.quandl.com www.data.gov Thanks Prem Moola (201.679.9071) From: Jörn Franke

LDA and evaluating topic number

2017-09-28 Thread Cody Buntain
Hi, all! Is there an example somewhere on using LDA’s logPerplexity()/logLikelihood() functions to evaluate topic counts? The existing MLLib LDA examples show calling them, but I can’t find any documentation about how to interpret the outputs. Graphing the outputs for logs of

Re: How to run MLlib's word2vec in CBOW mode?

2017-09-28 Thread pun
Thank you so much!Any sense to how long this may take to get released? TIA -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: More instances = slower Spark job

2017-09-28 Thread Jeroen Miller
More details on what I want to achieve. Maybe someone can suggest a course of action. My processing is extremely simple: reading .json.gz text files, filtering each line according a regex, and saving the surviving lines in a similarly named .gz file. Unfortunately changing the data format is

Re: More instances = slower Spark job

2017-09-28 Thread Vadim Semenov
Instead of having one job, you can try processing each file in a separate job, but run multiple jobs in parallel within one SparkContext. Something like this should work for you, it'll submit N jobs from the driver, the jobs will run independently, but executors will dynamically work on different

Re: Massive fetch fails, io errors in TransportRequestHandler

2017-09-28 Thread Vadim Semenov
Looks like there's slowness in sending shuffle files, maybe one executor get overwhelmed with all the other executors trying to pull data? Try lifting `spark.network.timeout` further, we ourselves had to push it to 600s from the default 120s On Thu, Sep 28, 2017 at 10:19 AM, Ilya Karpov

Upgraded to spark 2.2 and get Guava error

2017-09-28 Thread mckunkel
Greetings, I am trying to upgrade from 2.1.1 to 2.2 When I run some of the basic examples given on the webpage, I get an error Exception in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.()V from class org.apache.hadoop.mapred.FileInputFormat

Re: Upgraded to spark 2.2 and get Guava error

2017-09-28 Thread Michael C. Kunkel
Greetings, I also noticed that this error does not appear if I spark-submit. But I rather know how to solve this error so I can do simple testing in eclipse. Thanks BR MK Michael C. Kunkel, USMC, PhD Forschungszentrum Jülich Nuclear Physics Institute

Re: Upgraded to spark 2.2 and get Guava error

2017-09-28 Thread Michael C. Kunkel
Greetings, Sorry for the trouble, I had an old guava dependency in the pom file. Not sure how I missed it. BR MK Michael C. Kunkel, USMC, PhD Forschungszentrum Jülich Nuclear Physics Institute and Juelich Center for Hadron Physics Experimental Hadron

Re: More instances = slower Spark job

2017-09-28 Thread Jeroen Miller
On Thu, Sep 28, 2017 at 9:02 PM, Jörn Franke wrote: > It looks to me a little bit strange. First json.gz files are single threaded, > ie each file can only be processed by one thread (so it is good to have many > files of around 128 MB to 512 MB size each). Indeed.

Re: More instances = slower Spark job

2017-09-28 Thread Gourav Sengupta
I think that Vadim's response makes a lot of sense in terms of utilizing SPARK. Why are you not using JSON reader of SPARK? Your input has to follow a particular JSON style, but then it would be interesting to know whether you have looked into it at all. If you are going to read them only once

Re: Loading objects only once

2017-09-28 Thread Vadim Semenov
as an alternative ``` spark-submit --files ``` the files will be put on each executor in the working directory, so you can then load it alongside your `map` function Behind the scene it uses `SparkContext.addFile` method that you can use too

Re: How to read LZO file in Spark?

2017-09-28 Thread Vida Ha
https://docs.databricks.com/spark/latest/data-sources/read-lzo.html On Wed, Sep 27, 2017 at 6:36 AM 孫澤恩 wrote: > Hi All, > > Currently, I follow this blog > http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ > that > I could use hdfs

Re: Loading objects only once

2017-09-28 Thread Vadim Semenov
Something like this: ``` object Model { @transient lazy val modelObject = new ModelLoader("model-filename") def get() = modelObject } object SparkJob { def main(args: Array[String]) = { sc.addFile("s3://bucket/path/model-filename") sc.parallelize(…).map(test => {

Replicating a row n times

2017-09-28 Thread Kanagha Kumar
Hi, I'm trying to replicate a single row from a dataset n times and create a new dataset from it. But, while replicating I need a column's value to be changed for each replication since it would be end up as the primary key when stored finally. Looked at the following reference:

Persist DStream into a single file on HDFS

2017-09-28 Thread Mustafa Elbehery
Hi Folks, I am writing a pipeline which reads from Kafka, applying some transformations, then persist to HDFS. Obviously such operation is not supported to DStream, since the *DStream.save*(Path) *method, considers the Path as a directory, not a file. Also using

Customize Partitioner for Datasets

2017-09-28 Thread Kuchekar
Hi, Is there a way we can customize the partitioner for Dataset to be a Hive Hash Partitioner rather than Murmur3 Partitioner. Regards, Kuchekar, Nilesh

Re: Replicating a row n times

2017-09-28 Thread ayan guha
How about using row number for primary key? Select row_number() over (), * from table On Fri, 29 Sep 2017 at 10:21 am, Kanagha Kumar wrote: > Hi, > > I'm trying to replicate a single row from a dataset n times and create a > new dataset from it. But, while replicating I

Re: Applying a Java script to many files: Java API or also Python API?

2017-09-28 Thread Giuseppe Celano
Hi, What I meant is that I could run the Java script using the subprocess module in Python. In that case is any difference (from directly coding in the Java API) in performance expected? Thanks. > On Sep 28, 2017, at 3:32 AM, Weichen Xu wrote: > > I think you

Re: More instances = slower Spark job

2017-09-28 Thread Sonal Goyal
Also check if the compression algorithm you use is splittable? Thanks, Sonal Nube Technologies On Thu, Sep 28, 2017 at 2:17 PM, Tejeshwar J1 < tejeshwar...@globallogic.com.invalid> wrote: > Hi Miller, > > > > Try using > >

RE: More instances = slower Spark job

2017-09-28 Thread JG Perrin
As the others have mentioned, your loading time might kill your benchmark… I am in a similar process right now, but I time each operation, load, process 1, process 2, etc. not always easy with lazy operators, but you can try to force operations with false collect and cache (for benchmarking

Re: More instances = slower Spark job

2017-09-28 Thread Gourav Sengupta
Hi, no matter what you do and how many nodes you start, in case you have a single text file, it will not use parallelism. Therefore there are options of transferring the textfile to parquet, and other formats, or just splitting the text file itself into several individual files. Please do let

Re: More instances = slower Spark job

2017-09-28 Thread Steve Loughran
On 28 Sep 2017, at 09:41, Jeroen Miller > wrote: Hello, I am experiencing a disappointing performance issue with my Spark jobs as I scale up the number of instances. The task is trivial: I am loading large (compressed) text files from S3,

RE: Loading objects only once

2017-09-28 Thread JG Perrin
Maybe load the model on each executor’s disk and load it from there? Depending on how you use the data/model, using something like Livy and sharing the same connection may help? From: Naveen Swamy [mailto:mnnav...@gmail.com] Sent: Wednesday, September 27, 2017 9:08 PM To: user@spark.apache.org

Re: More instances = slower Spark job

2017-09-28 Thread ayan guha
Hi Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text file? Does it use InputFormat do create multiple splits and creates 1 partition per split? Also, in case of S3 or NFS, how does the input split work? I understand for HDFS files are already pre-split so Spark can use

How to run MLlib's word2vec in CBOW mode?

2017-09-28 Thread pun
Hello, My understanding is that word2vec can be ran in two modes: continuous bag-of-words (CBOW) (order of words does not matter) continuous skip-gram (order of words matters) I would like to run the *CBOW* implementation from Spark's MLlib, but it is not clear to me from the documentation and

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
> no matter what you do and how many nodes you start, in case you have a > single text file, it will not use parallelism. > This is not true, unless the file is small or is gzipped (gzipped files cannot be split).

This code makes the job runs 2x as long. Is there a way to improve it?

2017-09-28 Thread Noppanit Charassinvichai
We're trying to filter out some records of the output that we have to another table in ORC and the job takes twice as long. Not sure if there's a better way to do this? Here's the code jsonRows.foreachRDD(r => { val jsonDf = sqlSession.read.schema(sparrowSchema.schema).json(r) val cnsDf =

Massive fetch fails, io errors in TransportRequestHandler

2017-09-28 Thread Ilya Karpov
Hi, I see strange behaviour in my job, and can’t understand what is wrong: the stage that uses shuffle data as an input job fails number of times because of org.apache.spark.shuffle.FetchFailedException seen in spark UI: FetchFailed(BlockManagerId(8, hostname, 11431, None), shuffleId=1,

Re: Spark job taking 10s to allocate executors and memory before submitting job

2017-09-28 Thread Stéphane Verlet
Sounds like such a small job , if you running in on a cluster have you consider simply running it locally (master = local) ? On Wed, Sep 27, 2017 at 7:06 AM, navneet sharma wrote: > Hi, > > I am running spark job taking total 18s, in that 8 seconds for actual >

Re: More instances = slower Spark job

2017-09-28 Thread Gourav Sengupta
Hi, I will be very surprised if someone tells me that a 1 GB CSV text file is automatically split and read by multiple executors in SPARK. It does not matter whether it stays in HDFS, S3 or any other system. Now if someone tells me that in case I have a smaller CSV file of 100MB size and that

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
> Can you kindly explain how Spark uses parallelism for bigger (say 1GB) > text file? Does it use InputFormat do create multiple splits and creates 1 > partition per split? Also, in case of S3 or NFS, how does the input split > work? I understand for HDFS files are already pre-split so Spark can

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
On Thu, Sep 28, 2017 at 7:23 AM, Gourav Sengupta wrote: > > I will be very surprised if someone tells me that a 1 GB CSV text file is > automatically split and read by multiple executors in SPARK. It does not > matter whether it stays in HDFS, S3 or any other system. >

Re: Loading objects only once

2017-09-28 Thread Eike von Seggern
Hello, maybe broadcast can help you here. [1] You can load the model once on the driver and then broadcast it to the workers with `bc_model = sc.broadcast(model)`? You can access the model in the map function with `bc_model.value()`. Best Eike [1]

More instances = slower Spark job

2017-09-28 Thread Jeroen Miller
Hello, I am experiencing a disappointing performance issue with my Spark jobs as I scale up the number of instances. The task is trivial: I am loading large (compressed) text files from S3, filtering out lines that do not match a regex, counting the numbers of remaining lines and saving the

RE: More instances = slower Spark job

2017-09-28 Thread Tejeshwar J1
Hi Miller, Try using 1.*coalesce(numberOfPartitions*) to reduce the number of partitions in order to avoid idle cores . 2.Try reducing executor memory as you increase the number of executors. 3. Try performing GC or changing naïve java serialization to *kryo* serialization. Thanks,