I'm guessing it might be related to memory constraints for container,
please
>> check the yarn RM and NM logs to find out more details.
>>
>> Thanks
>> Saisai
>>
>> On Fri, Oct 21, 2016 at 8:14 AM, Xi Shen wrote:
>>>
>>> 16/10/20 18:12:14 ER
16/10/20 18:12:14 ERROR cluster.YarnClientSchedulerBackend: Yarn
application has already exited with state FINISHED!
From this, I think it is spark has difficult communicating with YARN. You
should check your Spark log.
On Fri, Oct 21, 2016 at 8:06 AM Li Li wrote:
which log file should I
On
If you are running on your local, I do not see the point that you start
with 32 executors with 2 cores for each.
Also, you can check the Spark web console to find out where the time spent.
Also, you may want to read
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
It is a plain Java IO error. Your line is too long. You should alter your
JSON schema, so each line is a small JSON object.
Please do not concatenate all the object into an array, then write the
array in one line. You will have difficulty handling your super large JSON
array in Spark anyway.
Beca
Okay, thank you.
On Mon, Oct 17, 2016 at 5:53 PM Sean Owen wrote:
> You can take the "with user-provided Hadoop" binary from the download
> page, and yes that should mean it does not drag in a Hive dependency of its
> own.
>
> On Mon, Oct 17, 2016 at 7:08 AM Xi Shen
I think most of the "big data" tools, like Spark and Hive, are not designed
to edit data. They are only designed to query data. I wonder in what
scenario you need to update large volume of data repetitively.
On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot
wrote:
> If my understanding is correct a
Hi,
I want to configure my Hive to use Spark 2 as its engine. According to
Hive's instruction, the Spark should build *without *Hadoop, nor Hive. I
could build my own, but for some reason I hope I could use a official
binary build.
So I want to ask if the official Spark binary build labeled "with
I found there are several .conf files in the conf directory, which one is
used as the default one when I click the "new" button on the notebook
homepage? I want to edit the default profile configuration so all my
notebooks are created with custom settings.
--
Thanks,
David S.
Hi,
First I am not sure if I should inherit from InputDStream, or
ReceiverInputDStream. For ReceiverInputDStream, why would I want to run a
receiver on each worker nodes?
If I want to inherit InputDStream, what should I do in the comput() method?
--
Thanks,
David S.
Hi Chitturi,
Please checkout
https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int
).
I think it is caused by the initialization step. the "kmeans||" method does
not initialize dataset in parallel. If your dataset is large, it takes
information, so I can use them to estimate the progress of the "map"
operation. Looking at the log, it feels like the jobs are done one by one
sequentially, rather than #cpu batch at a time.
I checked the worker node, and their CPU are all busy.
[image: --]
Xi Shen
[image: http
ing
}
IOUtils.writeLines(lines, System.separator(), file)
}
Note, I was using the IOUtils from common-io, not from Hadoop package.
The results are all file are created in myHDFS, but has no data at all...
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen
to produce this problem. Thanks! -Xiangrui
>
> On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen wrote:
>
>> Hi,
>>
>> I have opened a couple of threads asking about k-means performance
>> problem in Spark. I think I made a little progress.
>>
>> Previous I use th
On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng wrote:
> Hi Xi,
>
> Please create a JIRA if it takes longer to locate the issue. Did you
> try a smaller k?
>
> Best,
> Xiangrui
>
> On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen wrote:
> > Hi Burak,
&
nt to reproduce this problem.
I hope a Spark developer could comment on this problem and help identifying
if it is a bug.
Thanks,
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
<http://about.me/davidshen>
ss it.
Thanks,
David
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
<http://about.me/davidshen>
On Sun, Mar 29, 2015 at 4:34 PM, Burak Yavuz wrote:
> Hi David,
>
> Can you also try with Spark 1.3 if possible
use k*d doubles have to fit in the
> driver.
>
> Reza
>
> On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen wrote:
>
>> I have put more detail of my problem at http://stackoverflow.com/
>> questions/29295420/spark-kmeans-computation-cannot-be-distributed
>>
>> It is
that long pause.
Thanks,
David
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
<http://about.me/davidshen>
On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen wrote:
> Yes, I have done repartition.
>
> I tried to repartition to t
hen you load the data to
> equal the number of executors? If your ETL changes the number of
> partitions, you can also repartition before calling KMeans.
>
>
> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen wrote:
>
>> Hi,
>>
>> I have a large data set, and I expects to
I have to use .lines.toArray.toSeq
A little tricky.
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
<http://about.me/davidshen>
On Fri, Mar 27, 2015 at 4:41 PM, Xi Shen wrote:
> Hi,
>
> I want to loa
. I want to
know what's wrong with using the "lines()" function.
Thanks,
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
<http://about.me/davidshen>
Hi,
I have a large data set, and I expects to get 5000 clusters.
I load the raw data, convert them into DenseVector; then I did repartition
and cache; finally I give the RDD[Vector] to KMeans.train().
Now the job is running, and data are loaded. But according to the Spark UI,
all data are loaded
AM Xi Shen wrote:
> How do I get the number of cores that I specified at the command line? I
> want to use "spark.default.parallelism". I have 4 executors, each has 8
> cores. According to
> https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
> the &
of available cores with
> .repartition(numCores)
> 2) cache data
> 3) call .count() on data right before k-means
> 4) try k=500 (even less if possible)
>
> Thanks,
> Burak
>
> On Mar 26, 2015 4:15 PM, "Xi Shen" wrote:
> >
> > The code is very simple.
> &g
It it bought in by another dependency, so you do not need to specify it
explicitly...I think this is what Ted mean.
On Fri, Mar 27, 2015 at 9:48 AM Pala M Muthaia
wrote:
> +spark-dev
>
> Yes, the dependencies are there. I guess my question is how come the build
> is succeeding in the mainline th
s,
David
On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz wrote:
> Can you share the code snippet of how you call k-means? Do you cache the
> data before k-means? Did you repartition the data?
> On Mar 26, 2015 4:02 PM, "Xi Shen" wrote:
>
>> OH, the job I talked about
OH, the job I talked about has ran more than 11 hrs without a result...it
doesn't make sense.
On Fri, Mar 27, 2015 at 9:48 AM Xi Shen wrote:
> Hi Burak,
>
> My iterations is set to 500. But I think it should also stop of the
> centroid coverages, right?
>
> My spa
> increases the work in executors. If that's not the case, can you give more
> info on what Spark version you are using, your setup, and your dataset?
>
> Thanks,
> Burak
> On Mar 26, 2015 5:10 AM, "Xi Shen" wrote:
>
>> Hi,
>>
>> When I run k-mea
Hi Sandeep,
I followed the DenseKMeans example which comes with the spark package.
My total vectors are about 40k, and my k=500. All my code are written in
Scala.
Thanks,
David
On Fri, 27 Mar 2015 05:51 sandeep vura wrote:
> Hi Shen,
>
> I am also working on k means clustering with spark. May
Hi,
When I run k-means cluster with Spark, I got this in the last two lines in
the log:
15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
Then it hangs for a long time. There's no active job. The driver machine is
i
OK, after various testing, I found the native library can be loaded if
running in yarn-cluster mode. But I still cannot find out why it won't load
when running in yarn-client mode...
Thanks,
David
On Thu, Mar 26, 2015 at 4:21 PM Xi Shen wrote:
> Not of course...all machines in HDIns
ah~hell, I am using Spark 1.2.0, and my job was submitted to use 8
cores...the magic number in the bug.
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
<http://about.me/davidshen>
On Thu, Mar 26, 2015 at 5:48 PM, Akhil Das
wro
the
application log, it is miles long, and this is the only exception I found.
And it is no very useful to help me pin point the problem.
Any idea what would be the cause?
Thanks,
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
<htt
Not of course...all machines in HDInsight are Windows 64bit server. And I
have made sure all my DLLs are for 64bit machines. I have managed to get
those DLLs loade on my local machine which is also Windows 64bit.
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davids
; Blog: https://www.dbtsai.com
>
>
> On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen wrote:
> > Hi,
> >
> > I am doing ML using Spark mllib. However, I do not have full control to
> the
> > cluster. I am using Microsoft Azure HDInsight
> >
> > I want to deplo
What is your environment? I remember I had similar error when "running
spark-shell --master yarn-client" in Windows environment.
On Wed, Mar 25, 2015 at 9:07 PM sachin Singh
wrote:
> Hi ,
> when I am submitting spark job in cluster mode getting error as under in
> hadoop-yarn log,
> someone ha
Hi,
I am doing ML using Spark mllib. However, I do not have full control to the
cluster. I am using Microsoft Azure HDInsight
I want to deploy the BLAS or whatever required dependencies to accelerate
the computation. But I don't know how to deploy those DLLs when I submit my
JAR to the cluster.
gt;
> Ref: https://spark.apache.org/docs/latest/mllib-guide.html
>
> Burak
>
> On Sun, Mar 22, 2015 at 7:37 AM, Ted Yu wrote:
>
>> How about pointing LD_LIBRARY_PATH to native lib folder ?
>>
>> You need Spark 1.2.0 or higher for the above to work. See SPARK-1719
>>
>&
uot; vm option
Thanks,
David
On Sun, Mar 22, 2015 at 2:10 PM Ted Yu wrote:
> bq. the BLAS native cannot be loaded
>
> Have you tried specifying --driver-library-path option ?
>
> Cheers
>
> On Sat, Mar 21, 2015 at 4:42 PM, Xi Shen wrote:
>
>> Yeah, I think it is
tesian(rdd2).filter{ case (a,b) => a < b }
>
> Reza
>
> On Sat, Mar 21, 2015 at 10:37 PM, Xi Shen wrote:
>
>> Hi,
>>
>> I have two big RDD, and I need to do some math against each pair of them.
>> Traditionally, it is like a nested for-loop. But for
t/hadoop/lib/native ...
>
> Cheers
>
> On Sat, Mar 21, 2015 at 4:58 PM, Xi Shen wrote:
>
>> Hi,
>>
>> I use the *OpenBLAS* DLL, and have configured my application to work in
>> IDE. When I start my Spark application from IntelliJ IDE, I can see in the
>>
Hi,
I have two big RDD, and I need to do some math against each pair of them.
Traditionally, it is like a nested for-loop. But for RDD, it cause a nested
RDD which is prohibited.
Currently, I am collecting one of them, then do a nested for-loop, so to
avoid nested RDD. But would like to know if t
Hi,
I use the *OpenBLAS* DLL, and have configured my application to work in
IDE. When I start my Spark application from IntelliJ IDE, I can see in the
log that the native lib is loaded successfully.
But if I use *spark-submit* to start my application, the native lib still
cannot be load. I saw th
Yeah, I think it is harder to troubleshot the properties issues in a IDE.
But the reason I stick to IDE is because if I use spark-submit, the BLAS
native cannot be loaded. May be I should open another thread to discuss
that.
Thanks,
David
On Sun, 22 Mar 2015 10:38 Xi Shen wrote:
> In the
In the log, I saw
MemoryStorage: MemoryStore started with capacity 6.7GB
But I still can not find where to set this storage capacity.
On Sat, 21 Mar 2015 20:30 Xi Shen wrote:
> Hi Sean,
>
> It's getting strange now. If I ran from IDE, my executor memory is always
> set t
Hi Sean,
It's getting strange now. If I ran from IDE, my executor memory is always
set to 6.7G, no matter what value I set in code. I have check my
environment variable, and there's no value of 6.7, or 12.5
Any idea?
Thanks,
David
On Tue, 17 Mar 2015 00:35 null wrote:
> Hi Xi
gards
>
> Peng Xu
>
> 2015-03-16 19:46 GMT+08:00 Xi Shen :
>
>> Hi,
>>
>> In YARN mode you can specify the number of executors. I wonder if we can
>> also start multiple executors at local, just to make the test run faster.
>>
>> Thanks,
>> David
>>
>
>
Hi,
When you submit a jar to the spark cluster, it is very difficult to see the
logging. Is there any way to save the logging to a file? I mean only the
logging I created not the Spark log information.
Thanks,
David
Hi,
In YARN mode you can specify the number of executors. I wonder if we can
also start multiple executors at local, just to make the test run faster.
Thanks,
David
David
On Mon, 16 Mar 2015 22:30 Sean Owen wrote:
> I think you'd have to say more about "stopped working". Is the GC
> thrashing? does the UI respond? is the CPU busy or not?
>
> On Mon, Mar 16, 2015 at 4:25 AM, Xi Shen wrote:
> > Hi,
> >
> > I am
emory are you having on your machine? I think default value is
> 0.6 of the spark.executor.memory as you can see from here
> <http://spark.apache.org/docs/1.2.1/configuration.html#execution-behavior>
> .
>
> Thanks
> Best Regards
>
> On Mon, Mar 16, 2015 at 2:26 PM, Xi
ks
> Best Regards
>
> On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen wrote:
>
>> I set it in code, not by configuration. I submit my jar file to local. I
>> am working in my developer environment.
>>
>> On Mon, 16 Mar 2015 18:28 Akhil Das wrote:
>>
>>>
will allocate 4 threads. You can try increasing it to a higher
> number also try setting level of parallelism to a higher number.
>
> Thanks
> Best Regards
>
> On Mon, Mar 16, 2015 at 9:55 AM, Xi Shen wrote:
>
>> Hi,
>>
>> I am running k-means using Spark in local m
I set it in code, not by configuration. I submit my jar file to local. I am
working in my developer environment.
On Mon, 16 Mar 2015 18:28 Akhil Das wrote:
> How are you setting it? and how are you submitting the job?
>
> Thanks
> Best Regards
>
> On Mon, Mar 16, 2015 at
Hi,
I have set spark.executor.memory to 2048m, and in the UI "Environment"
page, I can see this value has been set correctly. But in the "Executors"
page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
value. why not 256MB, or just as what I set?
What am I missing here?
T
Hi,
I am running k-means using Spark in local mode. My data set is about 30k
records, and I set the k = 1000.
The algorithm starts and finished 13 jobs according to the UI monitor, then
it stopped working.
The last log I saw was:
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cl
Hey, I work it out myself :)
The "Vector" is actually a "SparesVector", so when it is written into a
string, the format is
(size, [coordinate], [value...])
Simple!
On Sat, Mar 14, 2015 at 6:05 PM Xi Shen wrote:
> Hi,
>
> I read this document,
> http://s
Hi,
I read this document,
http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and tried
to build a TF-IDF model of my documents.
I have a list of documents, each word is represented as a Int, and each
document is listed in one line.
doc_name, int1, int2...
doc_name, int3, int4...
Hi,
I have two RDD[Vector], both Vector are spares and of the form:
(id, value)
"id" indicates the position of the value in the vector space. I want to
apply dot product on two of such RDD[Vector] and get a scale value. The
none exist values are treated as zero.
Any convenient tool to do th
at
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:993)
... 12 more
Any suggestions?
Thanks,
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/da
Hi,
I read this page,
http://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. But I am
wondering, how to use this TF-IDF RDD? What is this TF-IDF vector looks
like?
Can someone provide me some guide?
Thanks,
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about
model import/export for some of the ML algorithms on the current
> master (and they'll be shipped with the 1.3 release).
>
> Burak
> On Mar 7, 2015 4:17 AM, "Xi Shen" wrote:
>
>> Wait...it seem SparkContext does not provide a way to save/load objec
Wait...it seem SparkContext does not provide a way to save/load object
files. It can only save/load RDD. What do I missed here?
Thanks,
David
On Sat, Mar 7, 2015 at 11:05 PM Xi Shen wrote:
> Ah~it is serializable. Thanks!
>
>
> On Sat, Mar 7, 2015 at 10:59 PM Ekrem Aksoy wrote:
Ah~it is serializable. Thanks!
On Sat, Mar 7, 2015 at 10:59 PM Ekrem Aksoy wrote:
> You can serialize your trained model to persist somewhere.
>
> Ekrem Aksoy
>
> On Sat, Mar 7, 2015 at 12:10 PM, Xi Shen wrote:
>
>> Hi,
>>
>> I checked a few ML
Hi,
I checked a few ML algorithms in MLLib.
https://spark.apache.org/docs/0.8.1/api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel
I could not find a way to save the trained model. Does this means I have to
train my model every time? Is there a more economic way t
o develop and debug your spark
> program, lets say, intellij idea or eclipse.
>
> Thanks,
> Sun.
>
> --
> fightf...@163.com
>
>
> *From:* Xi Shen
> *Date:* 2015-03-06 09:19
> *To:* user@spark.apache.org
> *Subject:* Spark code development pra
Hi,
I am new to Spark. I see every spark program has a main() function. I
wonder if I can run the spark program directly, without using spark-submit.
I think it will be easier for early development and debug.
Thanks,
David
Hi,
My HDFS and YARN services are started, and my spark-shell can wok in local
mode.
But when I try spark-shell --master yarn-client, a job can be created at
the YARN service, but will fail very soon. The Diagnostics are:
Application application_1425559747310_0002 failed 2 times due to AM
Contai
37 PM Sean Owen wrote:
> I don't think the build is at issue. The error suggests your App Master
> can't be contacted. Is there a network port issue? did the AM fail?
>
> On Tue, Feb 24, 2015 at 9:15 AM, Xi Shen wrote:
>
>> Hi Arush,
>>
>> I got the pre-bui
Tue, Feb 24, 2015 at 12:06 PM, Xi Shen wrote:
>
>> Hi,
>>
>> I followed this guide,
>> http://spark.apache.org/docs/1.2.1/running-on-yarn.html, and tried to
>> start spark-shell with yarn-client
>>
>> ./bin/spark-shell --master yarn-client
>>
>&
-shell as
standalone works.
My system:
- ubuntu amd64
- spark 1.2.1
- yarn from hadoop 2.6 stable
Thanks,
[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
<http://about.me/davidshen>
71 matches
Mail list logo