from:"\"Sabarish Sasidharan\""

Re: Spark or Storm

2015-06-16 Thread Sabarish Sasidharan

Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the

Re: Matrix Multiplication and mllib.recommendation

2015-06-17 Thread Sabarish Sasidharan

Nick is right. I too have implemented this way and it works just fine. In my case, there can be even more products. You simply broadcast blocks of products to userFeatures.mapPartitions() and BLAS multiply in there to get recommendations. In my case 10K products form one block. Note that you would

Re: Help optimising Spark SQL query

2015-06-23 Thread Sabarish Sasidharan

64GB in parquet could be many billions of rows because of the columnar compression. And count distinct by itself is an expensive operation. This is not just on Spark, even on Presto/Impala, you would see performance dip with count distincts. And the cluster is not that powerful either. The one iss

Re: Parquet problems

2015-06-24 Thread Sabarish Sasidharan

Did you try increasing the perm gen for the driver? Regards Sab On 24-Jun-2015 4:40 pm, "Anders Arpteg" wrote: > When reading large (and many) datasets with the Spark 1.4.0 DataFrames > parquet reader (the org.apache.spark.sql.parquet format), the following > exceptions are thrown: > > Exception

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Same error after the re-partition

2015-06-27 Thread Sabarish Sasidharan

Are you running on top of YARN? Plus pls provide your infrastructure details. Regards Sab On 28-Jun-2015 8:47 am, "Ayman Farahat" wrote: > Hello; > I tried to adjust the number of blocks by repartitioning the input. > Here is How I do it; (I am partitioning by users ) > > tot = newrdd.map(lambd

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Same error after the re-partition

2015-06-27 Thread Sabarish Sasidharan

Are you running on top of YARN? Plus pls provide your infrastructure details. Regards Sab On 28-Jun-2015 9:20 am, "Sabarish Sasidharan" < sabarish.sasidha...@manthan.com> wrote: > Are you running on top of YARN? Plus pls provide your infrastructure > details. > > Rega

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Same error after the re-partition

2015-06-27 Thread Sabarish Sasidharan

rom my iPhone > > On Jun 27, 2015, at 8:50 PM, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > > Are you running on top of YARN? Plus pls provide your infrastructure > details. > > Regards > Sab > On 28-Jun-2015 8:47 am, "Ayman F

Re: S3 Zip File Loading Advice

2016-03-09 Thread Sabarish Sasidharan

You can use S3's listKeys API and do a diff between consecutive listKeys to identify what's new. Are there multiple files in each zip? Single file archives are processed just like text as long as it is one of the supported compression formats. Regards Sab On Wed, Mar 9, 2016 at 10:33 AM, Benjami

Re: Zeppelin Integration

2016-03-10 Thread Sabarish Sasidharan

I believe you need to co-locate your Zeppelin on the same node where Spark is installed. You need to specify the SPARK HOME. The master I used was YARN. Zeppelin exposes a notebook interface. A notebook can have many paragraphs. You run the paragraphs. You can mix multiple contexts in the same not

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Sabarish Sasidharan

Which version of Spark are you using? The configuration varies by version. Regards Sab On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Joseph wrote: > Hi All, > > A Hive Join query which runs fine and faster in MapReduce takes lot of > time with Spark and finally fails with OOM. > > *Query: hivejoin.

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Sabarish Sasidharan

using MR > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpr

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Sabarish Sasidharan

caching anything. You could simply swap the fractions in your case. Regards Sab On Mon, Mar 14, 2016 at 2:20 PM, Prabhu Joseph wrote: > It is a Spark-SQL and the version used is Spark-1.2.1. > > On Mon, Mar 14, 2016 at 2:16 PM, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com

Re: Hive Query on Spark fails with OOM

2016-03-15 Thread Sabarish Sasidharan

performance. Regards Sab On 14-Mar-2016 2:33 pm, "Prabhu Joseph" wrote: > The issue is the query hits OOM on a Stage when reading Shuffle Output > from previous stage.How come increasing shuffle memory helps to avoid OOM. > > On Mon, Mar 14, 2016 at 2:28 PM, Sabarish S

Re: Compress individual RDD

2016-03-15 Thread Sabarish Sasidharan

It will compress only rdds with serialization enabled in the persistence mode. So you could skip _SER modes for your other rdds. Not perfect but something. On 15-Mar-2016 4:33 pm, "Nirav Patel" wrote: > Hi, > > I see that there's following spark config to compress an RDD. My guess is > it will c

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan

You have a slash before the bucket name. It should be @. Regards Sab On 15-Mar-2016 4:03 pm, "Yasemin Kaya" wrote: > Hi, > > I am using Spark 1.6.0 standalone and I want to read a txt file from S3 > bucket named yasemindeneme and my file name is deneme.txt. But I am getting > this error. Here is

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan

; Amazon strongly suggests that we do not use. Please use roles and you will > not have to worry about security. > > Regards, > Gourav Sengupta > > On Tue, Mar 15, 2016 at 2:38 PM, Sabarish Sasidharan < > sabarish@gmail.com> wrote: > >> You have a slash before

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Sabarish Sasidharan

Can't you just reduce the amount of data you insert by applying a filter so that only a small set of idpartitions is selected. You could have multiple such inserts to cover all idpartitions. Does that help? Regards Sab On 22 May 2016 1:11 pm, "swetha kasireddy" wrote: > I am looking at ORC. I in

Re: Pros and cons -Saving spark data in hive

2015-12-15 Thread Sabarish Sasidharan

If all you want to do is to load data into Hive, you don't need to use Spark. For subsequent query performance you would want to convert to ORC or Parquet when loading into Hive. Regards Sab On 16-Dec-2015 7:34 am, "Divya Gehlot" wrote: > Hi, > I am new bee to Spark and I am exploring option a

Re: [Beg for help] spark job with very low efficiency

2015-12-21 Thread Sabarish Sasidharan

collect() will bring everything to driver and is costly. Instead of using collect() + parallelize, you could use rdd1.checkpoint() along with a more efficient action like rdd1.count(). This you can do within the for loop. Hopefully you are using the Kryo serializer already. Regards Sab On Mon, D

Re: How to concat few rows into a new column in dataframe

2016-01-06 Thread Sabarish Sasidharan

You can just repartition by the id, if the final objective is to have all data for the same key in the same partition. Regards Sab On Wed, Jan 6, 2016 at 11:02 AM, Gavin Yue wrote: > I found that in 1.6 dataframe could do repartition. > > Should I still need to do orderby first or I just have t

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Sabarish Sasidharan

If you are on EMR, these can go into your hdfs site config. And will work with Spark on YARN by default. Regards Sab On 11-Jan-2016 5:16 pm, "Krishna Rao" wrote: > Hi all, > > Is there a method for reading from s3 without having to hard-code keys? > The only 2 ways I've found both require this:

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Sabarish Sasidharan

One option could be to store them as blobs in a cache like Redis and then read + broadcast them from the driver. Or you could store them in HDFS and read + broadcast from the driver. Regards Sab On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg wrote: > We have a bunch of Spark jobs deployed a

Re: FPGrowth does not handle large result sets

2016-01-12 Thread Sabarish Sasidharan

How much RAM are you giving to the driver? 17000 items being collected shouldn't fail unless your driver memory is too low. Regards Sab On 13-Jan-2016 6:14 am, "Ritu Raj Tiwari" wrote: > Folks: > We are running into a problem where FPGrowth seems to choke on data sets > that we think are not too

Re:

2016-01-12 Thread Sabarish Sasidharan

You could generate as many duplicates with a tag/sequence. And then use a custom partitioner that uses that tag/sequence in addition to the key to do the partitioning. Regards Sab On 12-Jan-2016 12:21 am, "Daniel Imberman" wrote: > Hi all, > > I'm looking for a way to efficiently partition an RD

Re: Split columns in RDD

2016-01-19 Thread Sabarish Sasidharan

The most efficient to determine the number of columns would be to do a take(1) and split in the driver. Regards Sab On 19-Jan-2016 8:48 pm, "Richard Siebeling" wrote: > Hi, > > what is the most efficient way to split columns and know how many columns > are created. > > Here is the current RDD >

Re: recommendProductsForUser for a subset of user

2016-02-03 Thread Sabarish Sasidharan

You could always construct a new MatrixFactorizationModel with your filtered set of user features and product features. I believe its just a stateless wrapper around the actual rdds. Regards Sab On Wed, Feb 3, 2016 at 5:28 AM, Roberto Pagliari wrote: > When using ALS, is it possible to use reco

RE: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread Sabarish Sasidharan

The Hive context can be used instead of sql context even when you are accessing data from non-Hive sources like mysql or postgres for ex. It has better sql support than the sqlcontext as it uses the HiveQL parser. Regards Sab On 15-Feb-2016 3:07 am, "Mich Talebzadeh" wrote: > Thanks. > > > > I

Re: Best practises of share Spark cluster over few applications

2016-02-14 Thread Sabarish Sasidharan

Yes you can look at using the capacity scheduler or the fair scheduler with YARN. Both allow using full cluster when idle. And both allow considering cpu plus memory when allocating resources which is sort of necessary with Spark. Regards Sab On 13-Feb-2016 10:11 pm, "Eugene Morozov" wrote: > Hi

Re: newbie unable to write to S3 403 forbidden error

2016-02-14 Thread Sabarish Sasidharan

Make sure you are using s3 bucket in same region. Also I would access my bucket this way s3n://bucketname/foldername. You can test privileges using the s3 cmd line client. Also, if you are using instance profiles you don't need to specify access and secret keys. No harm in specifying though. Reg

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-14 Thread Sabarish Sasidharan

Looks like your executors are running out of memory. YARN is not kicking them out. Just increase the executor memory. Also considering increasing the parallelism ie the number of partitions. Regards Sab On 11-Feb-2016 5:46 am, "Nirav Patel" wrote: > In Yarn we have following settings enabled so

Re: coalesce and executor memory

2016-02-14 Thread Sabarish Sasidharan

I believe you will gain more understanding if you look at or use mapPartitions() Regards Sab On 15-Feb-2016 8:38 am, "Christopher Brady" wrote: > I tried it without the cache, but it didn't change anything. The reason > for the cache is that other actions will be performed on this RDD, even > th

Re: which master option to view current running job in Spark UI

2016-02-14 Thread Sabarish Sasidharan

When running in YARN, you can use the YARN Resource Manager UI to get to the ApplicationMaster url, irrespective of client or cluster mode. Regards Sab On 15-Feb-2016 10:10 am, "Divya Gehlot" wrote: > Hi, > I have Hortonworks 2.3.4 cluster on EC2 and Have spark jobs as scala files > . > I am bit

Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Sabarish Sasidharan

You can setup SSH tunneling. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html Regards Sab On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot wrote: > Hi, > I have hadoop cluster set up in EC2. > I am unable to view application logs in Web UI as its taking intern

Re: Running multiple foreach loops

2016-02-17 Thread Sabarish Sasidharan

I don't think that's a good idea. Even if it wasn't in Spark. I am trying to understand the benefits you gain by separating. I would rather use a Composite pattern approach wherein adding to the composite cascades the additive operations to the children. Thereby your foreach code doesn't have to k

Re: Is this likely to cause any problems?

2016-02-19 Thread Sabarish Sasidharan

EMR does cost more than vanilla EC2. Using spark-ec2 can result in savings with large clusters, though that is not everybody's cup of tea. Regards Sab On 19-Feb-2016 7:55 pm, "Daniel Siegmann" wrote: > With EMR supporting Spark, I don't see much reason to use the spark-ec2 > script unless it is

Re: [Please Help] Log redirection on EMR

2016-02-21 Thread Sabarish Sasidharan

Your logs are getting archived in your logs bucket in S3. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html Regards Sab On Mon, Feb 22, 2016 at 12:14 PM, HARSH TAKKAR wrote: > Hi > > In am using an EMR cluster for running my spark jobs, but after the jo

Re: Using functional programming rather than SQL

2016-02-23 Thread Sabarish Sasidharan

When using SQL your full query, including the joins, were executed in Hive(or RDBMS) and only the results were brought into the Spark cluster. In the FP case, the data for the 3 tables is first pulled into the Spark cluster and then the join is executed. Thus the time difference. It's not immedia

Re: streaming spark is writing results to S3 a good idea?

2016-02-23 Thread Sabarish Sasidharan

And yes, storage grows on demand. No issues with that. Regards Sab On 24-Feb-2016 6:57 am, "Andy Davidson" wrote: > Currently our stream apps write results to hdfs. We are running into > problems with HDFS becoming corrupted and running out of space. It seems > like a better solution might be to

Re: streaming spark is writing results to S3 a good idea?

2016-02-23 Thread Sabarish Sasidharan

Writing to S3 is over the network. So will obviously be slower than local disk. That said, within AWS the network is pretty fast. Still you might want to write to S3 only after a certain threshold in data is reached, so that it's efficient. You might also want to use the DirectOutputCommitter as it

Re: Using functional programming rather than SQL

2016-02-24 Thread Sabarish Sasidharan

Yes Regards Sab On 24-Feb-2016 9:15 pm, "Koert Kuipers" wrote: > are you saying that HiveContext.sql(...) runs on hive, and not on spark > sql? > > On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > >> When

Re: Using Spark functional programming rather than SQL, Spark on Hive tables

2016-02-24 Thread Sabarish Sasidharan

iveContext.sql("SELECT > FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') > ").collect.foreach(println) > sys.exit > > *Results* > > Started at [24/02/2016 08:52:27.27] > res1: org.apache.spark.sql.DataFrame = [result: string] > s: org.apache.spa

Re: Using Spark functional programming rather than SQL, Spark on Hive tables

2016-02-24 Thread Sabarish Sasidharan

uot;SALES")).take(5).foreach(println) > println ("\nFinished at"); HiveContext.sql("SELECT > FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') > ").collect.foreach(println) > sys.exit > > *Results* > > Started at [24/02/2016 08:52:27.27] &

Re: Using functional programming rather than SQL

2016-02-24 Thread Sabarish Sasidharan

hive instance to run. > On Feb 24, 2016 19:15, "Sabarish Sasidharan" < > sabarish.sasidha...@manthan.com> wrote: > >> Yes >> >> Regards >> Sab >> On 24-Feb-2016 9:15 pm, "Koert Kuipers" wrote: >> >>> are you saying that H

Re: Execution plan in spark

2016-02-24 Thread Sabarish Sasidharan

There is no execution plan for FP. Execution plan exists for sql. Regards Sab On 24-Feb-2016 2:46 pm, "Ashok Kumar" wrote: > Gurus, > > Is there anything like explain in Spark to see the execution plan in > functional programming? > > warm regards >

Re: which is a more appropriate form of ratings ?

2016-02-25 Thread Sabarish Sasidharan

I believe the ALS algo expects the ratings to be aggregated (A). I don't see why you have to use decimals for rating. Regards Sab On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada wrote: > Hello. > > I just started working on CF in MLlib. > I am using trainImplicit because I only have implicit r

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Sabarish Sasidharan

Like Robin said, pls explore Pregel. You could do it without Pregel but it might be laborious. I have a simple outline below. You will need more iterations if the number of levels is higher. a-b b-c c-d b-e e-f f-c flatmaptopair a -> (a-b) b -> (a-b) b -> (b-c) c -> (b-c) c -> (c-d) d -> (c-d) b

Re: Multiple user operations in spark.

2016-02-25 Thread Sabarish Sasidharan

I don't have a proper answer to this. But to circumvent if you have 2 independent Spark jobs, you could update one when the other is serving reads. But it's still not scalable for incessant updates. Regards Sab On 25-Feb-2016 7:19 pm, "Udbhav Agarwal" wrote: > Hi, > > I am using graphx. I am add

Re: ALS trainImplicit performance

2016-02-25 Thread Sabarish Sasidharan

I have tested upto 3 billion. ALS scales, you just need to scale your cluster accordingly. More than building the model, it's getting the final recommendations that won't scale as nicely, especially when number of products is huge. This is the case when you are generating recommendations in a batch

Re: .cache() changes contents of RDD

2016-02-27 Thread Sabarish Sasidharan

This is because Hadoop writables are being reused. Just map it to some custom type and then do further operations including cache() on it. Regards Sab On 27-Feb-2016 9:11 am, "Yan Yang" wrote: > Hi > > I am pretty new to Spark, and after experimentation on our pipelines. I > ran into this weird

Re: Spark for client

2016-02-29 Thread Sabarish Sasidharan

Zeppelin? Regards Sab On 01-Mar-2016 12:27 am, "Mich Talebzadeh" wrote: > Hi, > > Is there such thing as Spark for client much like RDBMS client that have > cut down version of their big brother useful for client connectivity but > cannot be used as server. > > Thanks > > > Dr Mich Talebzadeh >

Re: DataSet Evidence

2016-03-01 Thread Sabarish Sasidharan

BeanInfo? On 01-Mar-2016 6:25 am, "Steve Lewis" wrote: > I have a relatively complex Java object that I would like to use in a > dataset > > if I say > > Encoder evidence = Encoders.kryo(MyType.class); > > JavaRDD rddMyType= generateRDD(); // some code > > Dataset datasetMyType= sqlCtx.createDa

Re: Spark on Windows platform

2016-03-01 Thread Sabarish Sasidharan

If all you want is Spark standalone then its as simple as installing the binaries and calling Spark submit passing your main class. I would advise against running on Hadoop on Windows, it's a bit of trouble. But yes you can do it if you want to. Regards Sab Regards Sab On 29-Feb-2016 6:58 pm, "ga

Re: Get rid of FileAlreadyExistsError

2016-03-01 Thread Sabarish Sasidharan

Have you tried spark.*hadoop*.*validateOutputSpecs*? On 01-Mar-2016 9:43 pm, "Peter Halliday" wrote: > http://pastebin.com/vbbFzyzb > > The problem seems to be to be two fold. First, the ParquetFileWriter in > Hadoop allows for an overwrite flag that Spark doesn’t allow to be set. > The second i

Re: Calrification on Spark-Hadoop Configuration

2015-10-01 Thread Sabarish Sasidharan

You can point to your custom HADOOP_CONF_DIR in your spark-env.sh Regards Sab On 01-Oct-2015 5:22 pm, "Vinoth Sankar" wrote: > Hi, > > I'm new to Spark. For my application I need to overwrite Hadoop > configurations (Can't change Configurations in Hadoop as it might affect my > regular HDFS), so

Re: Huge shuffle data size

2015-10-23 Thread Sabarish Sasidharan

How many rows are you joining? How many rows in the output? Regards Sab On 24-Oct-2015 2:32 am, "pratik khadloya" wrote: > Actually the groupBy is not taking a lot of time. > The join that i do later takes the most (95 %) amount of time. > Also, the grouping i am doing is based on the DataFrame

Re: Using Hadoop Custom Input format in Spark

2015-10-27 Thread Sabarish Sasidharan

Did you try the sc.binaryFiles() which gives you an RDD of PortableDataStream that wraps around the underlying bytes. On Tue, Oct 27, 2015 at 10:23 PM, Balachandar R.A. wrote: > Hello, > > > I have developed a hadoop based solution that process a binary file. This > uses classic hadoop MR techni

RE: Spark/Kafka Streaming Job Gets Stuck

2015-10-29 Thread Sabarish Sasidharan

If you are writing to S3, also make sure that you are using the direct output committer. I don't have streaming jobs but it helps in my machine learning jobs. Also, though more partitions help in processing faster, they do slow down writes to S3. So you might want to coalesce before writing to S3.

Re: Running FPGrowth over a JavaPairRDD?

2015-10-29 Thread Sabarish Sasidharan

Hi You cannot use PairRDD but you can use JavaRDD. So in your case, to make it work with least change, you would call run(transactions.values()). Each MLLib implementation has its own data structure typically and you would have to convert from your data structure before you invoke. For ex if you

Re: How to use data from Database and reload every hour

2015-11-05 Thread Sabarish Sasidharan

Theoretically the executor is a long lived container. So you could use some simple caching library or a simple Singleton to cache the data in your executors, once they load it from mysql. But note that with lots of executors you might choke your mysql. Regards Sab On 05-Nov-2015 7:03 pm, "Kay-Uwe

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Sabarish Sasidharan

Qubole is one option where you can use spots and get a couple other benefits. We use Qubole at Manthan for our Spark workloads. For ensuring all the nodes are ready, you could use yarn.minregisteredresourcesratio config property to ensure the execution doesn't start till the requisite containers h

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Sabarish Sasidharan

Qubole uses yarn. Regards Sab On 06-Nov-2015 8:31 am, "Jerry Lam" wrote: > Does Qubole use Yarn or Mesos for resource management? > > Sent from my iPhone > > > On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > > > > Qubole >

Re: Why is Kryo not the default serializer?

2015-11-09 Thread Sabarish Sasidharan

I have seen some failures in our workloads with Kryo, one I remember is a scenario with very large arrays. We could not get Kryo to work despite using the different configuration properties. Switching to java serde was what worked. Regards Sab On Tue, Nov 10, 2015 at 11:43 AM, Hitoshi Ozawa wrot

Re: thought experiment: use spark ML to real time prediction

2015-11-13 Thread Sabarish Sasidharan

That may not be an issue if the app using the models runs by itself (not bundled into an existing app), which may actually be the right way to design it considering separation of concerns. Regards Sab On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai wrote: > This will bring the whole dependencies of sp

Re: large, dense matrix multiplication

2015-11-13 Thread Sabarish Sasidharan

We have done this by blocking but without using BlockMatrix. We used our own blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What is the size of your block? How much memory are you giving to the executors? I assume you are running on YARN, if so you would want to make sure your ya

Re: Stack Overflow Question

2015-11-13 Thread Sabarish Sasidharan

The reserved cores are to prevent starvation so that user B cam run jobs when user A's job is already running and using almost all of the cluster. You can change your scheduler configuration to use more cores. Regards Sab On 13-Nov-2015 6:56 pm, "Parin Choganwala" wrote: > EMR 4.1.0 + Spark 1.5.

Re: large, dense matrix multiplication

2015-11-13 Thread Sabarish Sasidharan

scale up with larger matrices and more nodes. I’d be interested to hear how large a matrix multiplication you managed to do? Is there an alternative you’d recommend for calculating similarity over a large dataset? Thanks, Eilidh On 13 Nov 2015, at 09:55, Sabarish Sasidharan < sabarish

Re: Joining HDFS and JDBC data sources - benchmarks

2015-11-13 Thread Sabarish Sasidharan

If it is metadata why would we not cache it before we perform the join? Regards Sab On 13-Nov-2015 10:27 pm, "Eran Medan" wrote: > Hi > I'm looking for some benchmarks on joining data frames where most of the > data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is > still in RD

Re: send transformed RDD to s3 from slaves

2015-11-14 Thread Sabarish Sasidharan

It would be easier if you could show some code. Regards Sab On 14-Nov-2015 6:26 am, "Walrus theCat" wrote: > Hi, > > I have an RDD which crashes the driver when being collected. I want to > send the data on its partitions out to S3 without bringing it back to the > driver. I try calling rdd.for

Re: very slow parquet file write

2015-11-14 Thread Sabarish Sasidharan

How are you writing it out? Can you post some code? Regards Sab On 14-Nov-2015 5:21 am, "Rok Roskar" wrote: > I'm not sure what you mean? I didn't do anything specifically to partition > the columns > On Nov 14, 2015 00:38, "Davies Liu" wrote: > >> Do you have partitioned columns? >> >> On Thu,

Re: Spark and Spring Integrations

2015-11-14 Thread Sabarish Sasidharan

You are probably trying to access the spring context from the executors after initializing it at the driver. And running into serialization issues. You could instead use mapPartitions() and initialize the spring context from within that. That said I don't think that will solve all of your issues

Re: Spark and Spring Integrations

2015-11-15 Thread Sabarish Sasidharan

e > public void call(String t) throws Exception { > springBean.someAPI(t); // here we will have db transaction as > well. > } > });} > > Thanks, > Netai > > On Sat, Nov 14, 2015 at 9:49 PM, Sabarish Sasidharan < > sabarish.sasidha.

Re: Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Sabarish Sasidharan

You can try increasing the number of partitions before writing it out. Regards Sab On Mon, Nov 16, 2015 at 3:46 PM, Zhang, Jingyu wrote: > I am using spark-csv to save files in s3, it shown Size exceeds. Please let > me know how to fix it. Thanks. > > df.write() > .format("com.databricks.s

Re: Spark Expand Cluster

2015-11-16 Thread Sabarish Sasidharan

Spark will use the number of executors you specify in spark-submit. Are you saying that Spark is not able to use more executors after you modify it in spark-submit? Are you using dynamic allocation? Regards Sab On Mon, Nov 16, 2015 at 5:54 PM, dineshranganathan < dineshranganat...@gmail.com> wrot

Re: how can evenly distribute my records in all partition

2015-11-16 Thread Sabarish Sasidharan

You can write your own custom partitioner to achieve this Regards Sab On 17-Nov-2015 1:11 am, "prateek arora" wrote: > Hi > > I have a RDD with 30 record ( Key/value pair ) and running 30 executor . i > want to reparation this RDD in to 30 partition so every partition get one > record and assig

Re: Working with RDD from Java

2015-11-17 Thread Sabarish Sasidharan

You can also do rdd.toJavaRDD(). Pls check the API docs Regards Sab On 18-Nov-2015 3:12 am, "Bryan Cutler" wrote: > Hi Ivan, > > Since Spark 1.4.1 there is a Java-friendly function in LDAModel to get the > topic distributions called javaTopicDistributions() that returns a > JavaPairRDD. If you

Re: Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread Sabarish Sasidharan

The stack trace is clear enough: Caused by: com.rabbitmq.client.ShutdownSignalException: channel error; protocol method: #method(reply-code=406, reply-text=PRECONDITION_FAILED - inequivalent arg 'durable' for queue 'hello1' in vhost '/': received 'true' but current is 'false', class-id=50, method-

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-21 Thread Sabarish Sasidharan

Those are empty partitions. I don't see the number of partitions specified in code. That then implies the default parallelism config is being used and is set to a very high number, the sum of empty + non empty files. Regards Sab On 21-Nov-2015 11:59 pm, "Andy Davidson" wrote: > I start working o

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Sabarish Sasidharan

esce() and that coalesce() is taking a lot of > time. Any suggestions how to choose the file number of files. > > Kind regards > > Andy > > > From: Xiao Li > Date: Monday, November 23, 2015 at 12:21 PM > To: Andrew Davidson > Cc: Sabarish Sasidharan , "use

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Sabarish Sasidharan

If yarn has only 50 cores then it can support max 49 executors plus 1 driver application master. Regards Sab On 24-Nov-2015 1:58 pm, "谢廷稳" wrote: > OK, yarn.scheduler.maximum-allocation-mb is 16384. > > I have ran it again, the command to run it is: > ./bin/spark-submit --class org.apache.spark.

Re: Conversely, Hive is performing better than Spark-Sql

2015-11-24 Thread Sabarish Sasidharan

First of all, select * is not a useful SQL to evaluate. Very rarely would a user require all 362K records for visual analysis. Second, collect() forces movement of all data from executors to the driver. Instead write it out to some other table or to HDFS. Also Spark is more beneficial when you ha

Re: question about combining small parquet files

2015-11-30 Thread Sabarish Sasidharan

You could use the number of input files to determine the number of output partitions. This assumes your input file sizes are deterministic. Else, you could also persist the RDD and then determine it's size using the apis. Regards Sab On 26-Nov-2015 11:13 pm, "Nezih Yigitbasi" wrote: > Hi Spark

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-04 Thread Sabarish Sasidharan

#2: if using hdfs it's on the disks. You can use the HDFS command line to browse your data. And then use s3distcp or simply distcp to copy data from hdfs to S3. Or even use hdfs get commands to copy to local disk and then use S3 cli to copy to s3 #3. Cost of accessing data in S3 from Ec2 nodes, t

Re: Regarding rdd.collect()

2015-08-18 Thread Sabarish Sasidharan

It is still in memory for future rdd transformations and actions. What you get in driver is a copy of the data. Regards Sab On Tue, Aug 18, 2015 at 12:02 PM, praveen S wrote: > When I do an rdd.collect().. The data moves back to driver Or is still > held in memory across the executors? > --

Re: Difference btw MEMORY_ONLY and MEMORY_AND_DISK

2015-08-18 Thread Sabarish Sasidharan

MEMORY_ONLY will fail if there is not enough memory but MEMORY_AND_DISK will spill to disk Regards Sab On Tue, Aug 18, 2015 at 12:45 PM, Harsha HN <99harsha.h@gmail.com> wrote: > Hello Sparkers, > > I would like to understand difference btw these Storage levels for a RDD > portion that doesn

Re: How to increase the Json parsing speed

2015-08-27 Thread Sabarish Sasidharan

For your jsons, can you tell us what is your benchmark when running on a single machine using just plain Java (without Spark and Spark sql)? Regards Sab On 28-Aug-2015 7:29 am, "Gavin Yue" wrote: > Hey > > I am using the Json4s-Jackson parser coming with spark and parsing roughly > 80m records w

Re: How to increase the Json parsing speed

2015-08-27 Thread Sabarish Sasidharan

mapper.registerModule(DefaultScalaModule) > val s = mapper.readTree(line) > } > println(java.lang.System.currentTimeMillis() - begin) > > > One Json record contains two fileds : ID and List[Event]. > > I am guessing put all the events into List would take t

Re: How to increase the Json parsing speed

2015-08-27 Thread Sabarish Sasidharan

How many executors are you using when using Spark SQL? On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > I see that you are not reusing the same mapper instance in the Scala > snippet. > > Regards > Sab > > On Fri, Aug

Re: Creating BlockMatrix with java API

2015-09-22 Thread Sabarish Sasidharan

Hi Pulasthi You can always use JavaRDD.rdd() to get the scala rdd. So in your case, new BlockMatrix(rdd.rdd(), 2, 2) should work. Regards Sab On Tue, Sep 22, 2015 at 10:50 PM, Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > Hi Yanbo, > > Thanks for the reply. I thought i might

Re: K Means Explanation

2015-09-23 Thread Sabarish Sasidharan

You can't obtain that from the model. But you can always ask the model to predict the cluster center for your vectors by calling predict(). Regards Sab On Wed, Sep 23, 2015 at 7:24 PM, Tapan Sharma wrote: > Hi All, > > In the KMeans example provided under mllib, it traverse the outcome of > KMe

Re: Creating BlockMatrix with java API

2015-09-23 Thread Sabarish Sasidharan

rix>> >> >> As Yanbo has mentioned, I think a Java friendly constructor is still in >> demand. >> >> 2015-09-23 13:14 GMT+08:00 Pulasthi Supun Wickramasinghe >> : >> > Hi Sabarish >> > >> > Thanks, that would indeed solve my problem &g

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-24 Thread Sabarish Sasidharan

A little caution is needed as one executor per node may not always be ideal esp when your nodes have lots of RAM. But yes, using lesser number of executors has benefits like more efficient broadcasts. Regards Sab On 24-Sep-2015 2:57 pm, "Adrian Tanase" wrote: > RE: # because I already have a bun

Re: GroupBy Java objects in Java Spark

2015-09-24 Thread Sabarish Sasidharan

By java class objects if you mean your custom Java objects, yes of course. That will work. Regards Sab On 24-Sep-2015 3:36 pm, "Ramkumar V" wrote: > Hi, > > I want to know whether grouping by java class objects is possible or not > in java Spark. > > I have Tuple2< JavaObject, JavaObject>. i wan

Re: Optimized way to use spark as db to hdfs etl

2016-11-06 Thread Sabarish Sasidharan

Pls be aware that Accumulators involve communication back with the driver and may not be efficient. I think OP wants some way to extract the stats from the sql plan if it is being stored in some internal data structure Regards Sab On 5 Nov 2016 9:42 p.m., "Deepak Sharma" wrote: > Hi Rohit > You

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sabarish Sasidharan

@Sachin >>is not elastic. You need to anticipate before hand on volume of data you will have. Very difficult to add and reduce topic partitions later on. Why do you say so Sachin? Kafka Streams will readjust once we add more partitions to the Kafka topic. And when we add more machines, rebalancing

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sabarish Sasidharan

@Sachin >>The partition key is very important if you need to run multiple instances of streams application and certain instance processing certain partitions only. Again, depending on partition key is optional. It's actually a feature enabler, so we can use local state stores to improve throughput

Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan

I am trying to compute column similarities on a 30x1000 RowMatrix of DenseVectors. The size of the input RDD is 3.1MB and its all in one partition. I am running on a single node of 15G and giving the driver 1G and the executor 9G. This is on a single node hadoop. In the first attempt the BlockManag

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan

Sorry, I actually meant 30 x 1 matrix (missed a 0) Regards Sab

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan

Hi Reza I see that ((int, int), double) pairs are generated for any combination that meets the criteria controlled by the threshold. But assuming a simple 1x10K matrix that means I would need atleast 12GB memory per executor for the flat map just for these pairs excluding any other overhead. Is

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Sabarish Sasidharan

roups of similar points, consider > using K-means - it will get you clusters of similar rows with euclidean > distance. > > Best, > Reza > > > On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > >> >> Hi Rez

Re: mapPartitions - How Does it Works

2015-03-18 Thread Sabarish Sasidharan

Unlike a map() wherein your task is acting on a row at a time, with mapPartitions(), the task is passed the entire content of the partition in an iterator. You can then return back another iterator as the output. I don't do scala, but from what I understand from your code snippet... The iterator x

1 2 >

1 - 100 of 102 matches

Mail list logo