Whatever you write in bolts would be the logic you want to apply on your
events. In Spark, that logic would be coded in map() or similar such
transformations and/or actions. Spark doesn't enforce a structure for
capturing your processing logic like Storm does.
Regards
Sab
Probably overloading the
Nick is right. I too have implemented this way and it works just fine. In
my case, there can be even more products. You simply broadcast blocks of
products to userFeatures.mapPartitions() and BLAS multiply in there to get
recommendations. In my case 10K products form one block. Note that you
would
64GB in parquet could be many billions of rows because of the columnar
compression. And count distinct by itself is an expensive operation. This
is not just on Spark, even on Presto/Impala, you would see performance dip
with count distincts. And the cluster is not that powerful either.
The one iss
Did you try increasing the perm gen for the driver?
Regards
Sab
On 24-Jun-2015 4:40 pm, "Anders Arpteg" wrote:
> When reading large (and many) datasets with the Spark 1.4.0 DataFrames
> parquet reader (the org.apache.spark.sql.parquet format), the following
> exceptions are thrown:
>
> Exception
Are you running on top of YARN? Plus pls provide your infrastructure
details.
Regards
Sab
On 28-Jun-2015 8:47 am, "Ayman Farahat"
wrote:
> Hello;
> I tried to adjust the number of blocks by repartitioning the input.
> Here is How I do it; (I am partitioning by users )
>
> tot = newrdd.map(lambd
Are you running on top of YARN? Plus pls provide your infrastructure
details.
Regards
Sab
On 28-Jun-2015 9:20 am, "Sabarish Sasidharan" <
sabarish.sasidha...@manthan.com> wrote:
> Are you running on top of YARN? Plus pls provide your infrastructure
> details.
>
> Rega
rom my iPhone
>
> On Jun 27, 2015, at 8:50 PM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
> Are you running on top of YARN? Plus pls provide your infrastructure
> details.
>
> Regards
> Sab
> On 28-Jun-2015 8:47 am, "Ayman F
You can use S3's listKeys API and do a diff between consecutive listKeys to
identify what's new.
Are there multiple files in each zip? Single file archives are processed
just like text as long as it is one of the supported compression formats.
Regards
Sab
On Wed, Mar 9, 2016 at 10:33 AM, Benjami
I believe you need to co-locate your Zeppelin on the same node where Spark
is installed. You need to specify the SPARK HOME. The master I used was
YARN.
Zeppelin exposes a notebook interface. A notebook can have many paragraphs.
You run the paragraphs. You can mix multiple contexts in the same not
Which version of Spark are you using? The configuration varies by version.
Regards
Sab
On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Joseph
wrote:
> Hi All,
>
> A Hive Join query which runs fine and faster in MapReduce takes lot of
> time with Spark and finally fails with OOM.
>
> *Query: hivejoin.
using MR
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpr
caching anything. You could simply swap the
fractions in your case.
Regards
Sab
On Mon, Mar 14, 2016 at 2:20 PM, Prabhu Joseph
wrote:
> It is a Spark-SQL and the version used is Spark-1.2.1.
>
> On Mon, Mar 14, 2016 at 2:16 PM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com
performance.
Regards
Sab
On 14-Mar-2016 2:33 pm, "Prabhu Joseph" wrote:
> The issue is the query hits OOM on a Stage when reading Shuffle Output
> from previous stage.How come increasing shuffle memory helps to avoid OOM.
>
> On Mon, Mar 14, 2016 at 2:28 PM, Sabarish S
It will compress only rdds with serialization enabled in the persistence
mode. So you could skip _SER modes for your other rdds. Not perfect but
something.
On 15-Mar-2016 4:33 pm, "Nirav Patel" wrote:
> Hi,
>
> I see that there's following spark config to compress an RDD. My guess is
> it will c
You have a slash before the bucket name. It should be @.
Regards
Sab
On 15-Mar-2016 4:03 pm, "Yasemin Kaya" wrote:
> Hi,
>
> I am using Spark 1.6.0 standalone and I want to read a txt file from S3
> bucket named yasemindeneme and my file name is deneme.txt. But I am getting
> this error. Here is
; Amazon strongly suggests that we do not use. Please use roles and you will
> not have to worry about security.
>
> Regards,
> Gourav Sengupta
>
> On Tue, Mar 15, 2016 at 2:38 PM, Sabarish Sasidharan <
> sabarish@gmail.com> wrote:
>
>> You have a slash before
Can't you just reduce the amount of data you insert by applying a filter so
that only a small set of idpartitions is selected. You could have multiple
such inserts to cover all idpartitions. Does that help?
Regards
Sab
On 22 May 2016 1:11 pm, "swetha kasireddy"
wrote:
> I am looking at ORC. I in
If all you want to do is to load data into Hive, you don't need to use
Spark.
For subsequent query performance you would want to convert to ORC or
Parquet when loading into Hive.
Regards
Sab
On 16-Dec-2015 7:34 am, "Divya Gehlot" wrote:
> Hi,
> I am new bee to Spark and I am exploring option a
collect() will bring everything to driver and is costly. Instead of using
collect() + parallelize, you could use rdd1.checkpoint() along with a more
efficient action like rdd1.count(). This you can do within the for loop.
Hopefully you are using the Kryo serializer already.
Regards
Sab
On Mon, D
You can just repartition by the id, if the final objective is to have all
data for the same key in the same partition.
Regards
Sab
On Wed, Jan 6, 2016 at 11:02 AM, Gavin Yue wrote:
> I found that in 1.6 dataframe could do repartition.
>
> Should I still need to do orderby first or I just have t
If you are on EMR, these can go into your hdfs site config. And will work
with Spark on YARN by default.
Regards
Sab
On 11-Jan-2016 5:16 pm, "Krishna Rao" wrote:
> Hi all,
>
> Is there a method for reading from s3 without having to hard-code keys?
> The only 2 ways I've found both require this:
One option could be to store them as blobs in a cache like Redis and then
read + broadcast them from the driver. Or you could store them in HDFS and
read + broadcast from the driver.
Regards
Sab
On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg wrote:
> We have a bunch of Spark jobs deployed a
How much RAM are you giving to the driver? 17000 items being collected
shouldn't fail unless your driver memory is too low.
Regards
Sab
On 13-Jan-2016 6:14 am, "Ritu Raj Tiwari"
wrote:
> Folks:
> We are running into a problem where FPGrowth seems to choke on data sets
> that we think are not too
You could generate as many duplicates with a tag/sequence. And then use a
custom partitioner that uses that tag/sequence in addition to the key to do
the partitioning.
Regards
Sab
On 12-Jan-2016 12:21 am, "Daniel Imberman"
wrote:
> Hi all,
>
> I'm looking for a way to efficiently partition an RD
The most efficient to determine the number of columns would be to do a
take(1) and split in the driver.
Regards
Sab
On 19-Jan-2016 8:48 pm, "Richard Siebeling" wrote:
> Hi,
>
> what is the most efficient way to split columns and know how many columns
> are created.
>
> Here is the current RDD
>
You could always construct a new MatrixFactorizationModel with your
filtered set of user features and product features. I believe its just a
stateless wrapper around the actual rdds.
Regards
Sab
On Wed, Feb 3, 2016 at 5:28 AM, Roberto Pagliari
wrote:
> When using ALS, is it possible to use reco
The Hive context can be used instead of sql context even when you are
accessing data from non-Hive sources like mysql or postgres for ex. It has
better sql support than the sqlcontext as it uses the HiveQL parser.
Regards
Sab
On 15-Feb-2016 3:07 am, "Mich Talebzadeh" wrote:
> Thanks.
>
>
>
> I
Yes you can look at using the capacity scheduler or the fair scheduler with
YARN. Both allow using full cluster when idle. And both allow considering
cpu plus memory when allocating resources which is sort of necessary with
Spark.
Regards
Sab
On 13-Feb-2016 10:11 pm, "Eugene Morozov"
wrote:
> Hi
Make sure you are using s3 bucket in same region. Also I would access my
bucket this way s3n://bucketname/foldername.
You can test privileges using the s3 cmd line client.
Also, if you are using instance profiles you don't need to specify access
and secret keys. No harm in specifying though.
Reg
Looks like your executors are running out of memory. YARN is not kicking
them out. Just increase the executor memory. Also considering increasing
the parallelism ie the number of partitions.
Regards
Sab
On 11-Feb-2016 5:46 am, "Nirav Patel" wrote:
> In Yarn we have following settings enabled so
I believe you will gain more understanding if you look at or use
mapPartitions()
Regards
Sab
On 15-Feb-2016 8:38 am, "Christopher Brady"
wrote:
> I tried it without the cache, but it didn't change anything. The reason
> for the cache is that other actions will be performed on this RDD, even
> th
When running in YARN, you can use the YARN Resource Manager UI to get to
the ApplicationMaster url, irrespective of client or cluster mode.
Regards
Sab
On 15-Feb-2016 10:10 am, "Divya Gehlot" wrote:
> Hi,
> I have Hortonworks 2.3.4 cluster on EC2 and Have spark jobs as scala files
> .
> I am bit
You can setup SSH tunneling.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html
Regards
Sab
On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot
wrote:
> Hi,
> I have hadoop cluster set up in EC2.
> I am unable to view application logs in Web UI as its taking intern
I don't think that's a good idea. Even if it wasn't in Spark. I am trying
to understand the benefits you gain by separating.
I would rather use a Composite pattern approach wherein adding to the
composite cascades the additive operations to the children. Thereby your
foreach code doesn't have to k
EMR does cost more than vanilla EC2. Using spark-ec2 can result in savings
with large clusters, though that is not everybody's cup of tea.
Regards
Sab
On 19-Feb-2016 7:55 pm, "Daniel Siegmann"
wrote:
> With EMR supporting Spark, I don't see much reason to use the spark-ec2
> script unless it is
Your logs are getting archived in your logs bucket in S3.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html
Regards
Sab
On Mon, Feb 22, 2016 at 12:14 PM, HARSH TAKKAR
wrote:
> Hi
>
> In am using an EMR cluster for running my spark jobs, but after the jo
When using SQL your full query, including the joins, were executed in
Hive(or RDBMS) and only the results were brought into the Spark cluster. In
the FP case, the data for the 3 tables is first pulled into the Spark
cluster and then the join is executed.
Thus the time difference.
It's not immedia
And yes, storage grows on demand. No issues with that.
Regards
Sab
On 24-Feb-2016 6:57 am, "Andy Davidson"
wrote:
> Currently our stream apps write results to hdfs. We are running into
> problems with HDFS becoming corrupted and running out of space. It seems
> like a better solution might be to
Writing to S3 is over the network. So will obviously be slower than local
disk. That said, within AWS the network is pretty fast. Still you might
want to write to S3 only after a certain threshold in data is reached, so
that it's efficient. You might also want to use the DirectOutputCommitter
as it
Yes
Regards
Sab
On 24-Feb-2016 9:15 pm, "Koert Kuipers" wrote:
> are you saying that HiveContext.sql(...) runs on hive, and not on spark
> sql?
>
> On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> When
iveContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
> sys.exit
>
> *Results*
>
> Started at [24/02/2016 08:52:27.27]
> res1: org.apache.spark.sql.DataFrame = [result: string]
> s: org.apache.spa
uot;SALES")).take(5).foreach(println)
> println ("\nFinished at"); HiveContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
> sys.exit
>
> *Results*
>
> Started at [24/02/2016 08:52:27.27]
&
hive instance to run.
> On Feb 24, 2016 19:15, "Sabarish Sasidharan" <
> sabarish.sasidha...@manthan.com> wrote:
>
>> Yes
>>
>> Regards
>> Sab
>> On 24-Feb-2016 9:15 pm, "Koert Kuipers" wrote:
>>
>>> are you saying that H
There is no execution plan for FP. Execution plan exists for sql.
Regards
Sab
On 24-Feb-2016 2:46 pm, "Ashok Kumar" wrote:
> Gurus,
>
> Is there anything like explain in Spark to see the execution plan in
> functional programming?
>
> warm regards
>
I believe the ALS algo expects the ratings to be aggregated (A). I don't
see why you have to use decimals for rating.
Regards
Sab
On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada wrote:
> Hello.
>
> I just started working on CF in MLlib.
> I am using trainImplicit because I only have implicit r
Like Robin said, pls explore Pregel. You could do it without Pregel but it
might be laborious. I have a simple outline below. You will need more
iterations if the number of levels is higher.
a-b
b-c
c-d
b-e
e-f
f-c
flatmaptopair
a -> (a-b)
b -> (a-b)
b -> (b-c)
c -> (b-c)
c -> (c-d)
d -> (c-d)
b
I don't have a proper answer to this. But to circumvent if you have 2
independent Spark jobs, you could update one when the other is serving
reads. But it's still not scalable for incessant updates.
Regards
Sab
On 25-Feb-2016 7:19 pm, "Udbhav Agarwal" wrote:
> Hi,
>
> I am using graphx. I am add
I have tested upto 3 billion. ALS scales, you just need to scale your
cluster accordingly. More than building the model, it's getting the final
recommendations that won't scale as nicely, especially when number of
products is huge. This is the case when you are generating recommendations
in a batch
This is because Hadoop writables are being reused. Just map it to some
custom type and then do further operations including cache() on it.
Regards
Sab
On 27-Feb-2016 9:11 am, "Yan Yang" wrote:
> Hi
>
> I am pretty new to Spark, and after experimentation on our pipelines. I
> ran into this weird
Zeppelin?
Regards
Sab
On 01-Mar-2016 12:27 am, "Mich Talebzadeh"
wrote:
> Hi,
>
> Is there such thing as Spark for client much like RDBMS client that have
> cut down version of their big brother useful for client connectivity but
> cannot be used as server.
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
BeanInfo?
On 01-Mar-2016 6:25 am, "Steve Lewis" wrote:
> I have a relatively complex Java object that I would like to use in a
> dataset
>
> if I say
>
> Encoder evidence = Encoders.kryo(MyType.class);
>
> JavaRDD rddMyType= generateRDD(); // some code
>
> Dataset datasetMyType= sqlCtx.createDa
If all you want is Spark standalone then its as simple as installing the
binaries and calling Spark submit passing your main class. I would advise
against running on Hadoop on Windows, it's a bit of trouble. But yes you
can do it if you want to.
Regards
Sab
Regards
Sab
On 29-Feb-2016 6:58 pm, "ga
Have you tried spark.*hadoop*.*validateOutputSpecs*?
On 01-Mar-2016 9:43 pm, "Peter Halliday" wrote:
> http://pastebin.com/vbbFzyzb
>
> The problem seems to be to be two fold. First, the ParquetFileWriter in
> Hadoop allows for an overwrite flag that Spark doesn’t allow to be set.
> The second i
You can point to your custom HADOOP_CONF_DIR in your spark-env.sh
Regards
Sab
On 01-Oct-2015 5:22 pm, "Vinoth Sankar" wrote:
> Hi,
>
> I'm new to Spark. For my application I need to overwrite Hadoop
> configurations (Can't change Configurations in Hadoop as it might affect my
> regular HDFS), so
How many rows are you joining? How many rows in the output?
Regards
Sab
On 24-Oct-2015 2:32 am, "pratik khadloya" wrote:
> Actually the groupBy is not taking a lot of time.
> The join that i do later takes the most (95 %) amount of time.
> Also, the grouping i am doing is based on the DataFrame
Did you try the sc.binaryFiles() which gives you an RDD of
PortableDataStream that wraps around the underlying bytes.
On Tue, Oct 27, 2015 at 10:23 PM, Balachandar R.A. wrote:
> Hello,
>
>
> I have developed a hadoop based solution that process a binary file. This
> uses classic hadoop MR techni
If you are writing to S3, also make sure that you are using the direct
output committer. I don't have streaming jobs but it helps in my machine
learning jobs. Also, though more partitions help in processing faster, they
do slow down writes to S3. So you might want to coalesce before writing to
S3.
Hi
You cannot use PairRDD but you can use JavaRDD. So in your case, to
make it work with least change, you would call run(transactions.values()).
Each MLLib implementation has its own data structure typically and you
would have to convert from your data structure before you invoke. For ex if
you
Theoretically the executor is a long lived container. So you could use some
simple caching library or a simple Singleton to cache the data in your
executors, once they load it from mysql. But note that with lots of
executors you might choke your mysql.
Regards
Sab
On 05-Nov-2015 7:03 pm, "Kay-Uwe
Qubole is one option where you can use spots and get a couple other
benefits. We use Qubole at Manthan for our Spark workloads.
For ensuring all the nodes are ready, you could use
yarn.minregisteredresourcesratio config property to ensure the execution
doesn't start till the requisite containers h
Qubole uses yarn.
Regards
Sab
On 06-Nov-2015 8:31 am, "Jerry Lam" wrote:
> Does Qubole use Yarn or Mesos for resource management?
>
> Sent from my iPhone
>
> > On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
> >
> > Qubole
>
I have seen some failures in our workloads with Kryo, one I remember is a
scenario with very large arrays. We could not get Kryo to work despite
using the different configuration properties. Switching to java serde was
what worked.
Regards
Sab
On Tue, Nov 10, 2015 at 11:43 AM, Hitoshi Ozawa
wrot
That may not be an issue if the app using the models runs by itself (not
bundled into an existing app), which may actually be the right way to
design it considering separation of concerns.
Regards
Sab
On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai wrote:
> This will bring the whole dependencies of sp
We have done this by blocking but without using BlockMatrix. We used our
own blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What
is the size of your block? How much memory are you giving to the executors?
I assume you are running on YARN, if so you would want to make sure your
ya
The reserved cores are to prevent starvation so that user B cam run jobs
when user A's job is already running and using almost all of the cluster.
You can change your scheduler configuration to use more cores.
Regards
Sab
On 13-Nov-2015 6:56 pm, "Parin Choganwala" wrote:
> EMR 4.1.0 + Spark 1.5.
scale up with larger matrices and more nodes. I’d be
interested to hear how large a matrix multiplication you managed to do?
Is there an alternative you’d recommend for calculating similarity over a
large dataset?
Thanks,
Eilidh
On 13 Nov 2015, at 09:55, Sabarish Sasidharan <
sabarish
If it is metadata why would we not cache it before we perform the join?
Regards
Sab
On 13-Nov-2015 10:27 pm, "Eran Medan" wrote:
> Hi
> I'm looking for some benchmarks on joining data frames where most of the
> data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is
> still in RD
It would be easier if you could show some code.
Regards
Sab
On 14-Nov-2015 6:26 am, "Walrus theCat" wrote:
> Hi,
>
> I have an RDD which crashes the driver when being collected. I want to
> send the data on its partitions out to S3 without bringing it back to the
> driver. I try calling rdd.for
How are you writing it out? Can you post some code?
Regards
Sab
On 14-Nov-2015 5:21 am, "Rok Roskar" wrote:
> I'm not sure what you mean? I didn't do anything specifically to partition
> the columns
> On Nov 14, 2015 00:38, "Davies Liu" wrote:
>
>> Do you have partitioned columns?
>>
>> On Thu,
You are probably trying to access the spring context from the executors
after initializing it at the driver. And running into serialization issues.
You could instead use mapPartitions() and initialize the spring context
from within that.
That said I don't think that will solve all of your issues
e
> public void call(String t) throws Exception {
> springBean.someAPI(t); // here we will have db transaction as
> well.
> }
> });}
>
> Thanks,
> Netai
>
> On Sat, Nov 14, 2015 at 9:49 PM, Sabarish Sasidharan <
> sabarish.sasidha.
You can try increasing the number of partitions before writing it out.
Regards
Sab
On Mon, Nov 16, 2015 at 3:46 PM, Zhang, Jingyu
wrote:
> I am using spark-csv to save files in s3, it shown Size exceeds. Please let
> me know how to fix it. Thanks.
>
> df.write()
> .format("com.databricks.s
Spark will use the number of executors you specify in spark-submit. Are you
saying that Spark is not able to use more executors after you modify it in
spark-submit? Are you using dynamic allocation?
Regards
Sab
On Mon, Nov 16, 2015 at 5:54 PM, dineshranganathan <
dineshranganat...@gmail.com> wrot
You can write your own custom partitioner to achieve this
Regards
Sab
On 17-Nov-2015 1:11 am, "prateek arora" wrote:
> Hi
>
> I have a RDD with 30 record ( Key/value pair ) and running 30 executor . i
> want to reparation this RDD in to 30 partition so every partition get one
> record and assig
You can also do rdd.toJavaRDD(). Pls check the API docs
Regards
Sab
On 18-Nov-2015 3:12 am, "Bryan Cutler" wrote:
> Hi Ivan,
>
> Since Spark 1.4.1 there is a Java-friendly function in LDAModel to get the
> topic distributions called javaTopicDistributions() that returns a
> JavaPairRDD. If you
The stack trace is clear enough:
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
protocol method: #method(reply-code=406,
reply-text=PRECONDITION_FAILED - inequivalent arg 'durable' for queue
'hello1' in vhost '/': received 'true' but current is 'false', class-id=50,
method-
Those are empty partitions. I don't see the number of partitions specified
in code. That then implies the default parallelism config is being used and
is set to a very high number, the sum of empty + non empty files.
Regards
Sab
On 21-Nov-2015 11:59 pm, "Andy Davidson"
wrote:
> I start working o
esce() and that coalesce() is taking a lot of
> time. Any suggestions how to choose the file number of files.
>
> Kind regards
>
> Andy
>
>
> From: Xiao Li
> Date: Monday, November 23, 2015 at 12:21 PM
> To: Andrew Davidson
> Cc: Sabarish Sasidharan , "use
If yarn has only 50 cores then it can support max 49 executors plus 1
driver application master.
Regards
Sab
On 24-Nov-2015 1:58 pm, "谢廷稳" wrote:
> OK, yarn.scheduler.maximum-allocation-mb is 16384.
>
> I have ran it again, the command to run it is:
> ./bin/spark-submit --class org.apache.spark.
First of all, select * is not a useful SQL to evaluate. Very rarely would a
user require all 362K records for visual analysis.
Second, collect() forces movement of all data from executors to the driver.
Instead write it out to some other table or to HDFS.
Also Spark is more beneficial when you ha
You could use the number of input files to determine the number of output
partitions. This assumes your input file sizes are deterministic.
Else, you could also persist the RDD and then determine it's size using the
apis.
Regards
Sab
On 26-Nov-2015 11:13 pm, "Nezih Yigitbasi"
wrote:
> Hi Spark
#2: if using hdfs it's on the disks. You can use the HDFS command line to
browse your data. And then use s3distcp or simply distcp to copy data from
hdfs to S3. Or even use hdfs get commands to copy to local disk and then
use S3 cli to copy to s3
#3. Cost of accessing data in S3 from Ec2 nodes, t
It is still in memory for future rdd transformations and actions. What you
get in driver is a copy of the data.
Regards
Sab
On Tue, Aug 18, 2015 at 12:02 PM, praveen S wrote:
> When I do an rdd.collect().. The data moves back to driver Or is still
> held in memory across the executors?
>
--
MEMORY_ONLY will fail if there is not enough memory but MEMORY_AND_DISK
will spill to disk
Regards
Sab
On Tue, Aug 18, 2015 at 12:45 PM, Harsha HN <99harsha.h@gmail.com>
wrote:
> Hello Sparkers,
>
> I would like to understand difference btw these Storage levels for a RDD
> portion that doesn
For your jsons, can you tell us what is your benchmark when running on a
single machine using just plain Java (without Spark and Spark sql)?
Regards
Sab
On 28-Aug-2015 7:29 am, "Gavin Yue" wrote:
> Hey
>
> I am using the Json4s-Jackson parser coming with spark and parsing roughly
> 80m records w
mapper.registerModule(DefaultScalaModule)
> val s = mapper.readTree(line)
> }
> println(java.lang.System.currentTimeMillis() - begin)
>
>
> One Json record contains two fileds : ID and List[Event].
>
> I am guessing put all the events into List would take t
How many executors are you using when using Spark SQL?
On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:
> I see that you are not reusing the same mapper instance in the Scala
> snippet.
>
> Regards
> Sab
>
> On Fri, Aug
Hi Pulasthi
You can always use JavaRDD.rdd() to get the scala rdd. So in your case,
new BlockMatrix(rdd.rdd(), 2, 2)
should work.
Regards
Sab
On Tue, Sep 22, 2015 at 10:50 PM, Pulasthi Supun Wickramasinghe <
pulasthi...@gmail.com> wrote:
> Hi Yanbo,
>
> Thanks for the reply. I thought i might
You can't obtain that from the model. But you can always ask the model to
predict the cluster center for your vectors by calling predict().
Regards
Sab
On Wed, Sep 23, 2015 at 7:24 PM, Tapan Sharma
wrote:
> Hi All,
>
> In the KMeans example provided under mllib, it traverse the outcome of
> KMe
rix>>
>>
>> As Yanbo has mentioned, I think a Java friendly constructor is still in
>> demand.
>>
>> 2015-09-23 13:14 GMT+08:00 Pulasthi Supun Wickramasinghe
>> :
>> > Hi Sabarish
>> >
>> > Thanks, that would indeed solve my problem
&g
A little caution is needed as one executor per node may not always be ideal
esp when your nodes have lots of RAM. But yes, using lesser number of
executors has benefits like more efficient broadcasts.
Regards
Sab
On 24-Sep-2015 2:57 pm, "Adrian Tanase" wrote:
> RE: # because I already have a bun
By java class objects if you mean your custom Java objects, yes of course.
That will work.
Regards
Sab
On 24-Sep-2015 3:36 pm, "Ramkumar V" wrote:
> Hi,
>
> I want to know whether grouping by java class objects is possible or not
> in java Spark.
>
> I have Tuple2< JavaObject, JavaObject>. i wan
Pls be aware that Accumulators involve communication back with the driver
and may not be efficient. I think OP wants some way to extract the stats
from the sql plan if it is being stored in some internal data structure
Regards
Sab
On 5 Nov 2016 9:42 p.m., "Deepak Sharma" wrote:
> Hi Rohit
> You
@Sachin
>>is not elastic. You need to anticipate before hand on volume of data you
will have. Very difficult to add and reduce topic partitions later on.
Why do you say so Sachin? Kafka Streams will readjust once we add more
partitions to the Kafka topic. And when we add more machines, rebalancing
@Sachin
>>The partition key is very important if you need to run multiple instances
of streams application and certain instance processing certain partitions
only.
Again, depending on partition key is optional. It's actually a feature
enabler, so we can use local state stores to improve throughput
I am trying to compute column similarities on a 30x1000 RowMatrix of
DenseVectors. The size of the input RDD is 3.1MB and its all in one
partition. I am running on a single node of 15G and giving the driver 1G
and the executor 9G. This is on a single node hadoop. In the first attempt
the BlockManag
Sorry, I actually meant 30 x 1 matrix (missed a 0)
Regards
Sab
Hi Reza
I see that ((int, int), double) pairs are generated for any combination
that meets the criteria controlled by the threshold. But assuming a simple
1x10K matrix that means I would need atleast 12GB memory per executor for
the flat map just for these pairs excluding any other overhead. Is
roups of similar points, consider
> using K-means - it will get you clusters of similar rows with euclidean
> distance.
>
> Best,
> Reza
>
>
> On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>>
>> Hi Rez
Unlike a map() wherein your task is acting on a row at a time, with
mapPartitions(), the task is passed the entire content of the partition in
an iterator. You can then return back another iterator as the output. I
don't do scala, but from what I understand from your code snippet... The
iterator x
1 - 100 of 102 matches
Mail list logo