Why? I tried this solution and works fine.
El martes, 9 de junio de 2015, codingforfun [via Apache Spark User List]
ml-node+s1001560n23218...@n3.nabble.com escribió:
Hi drarse, thanks for replying, the way you said use a singleton object
does not work
在 2015-06-09 16:24:25,drarse [via
The shuffle data can be deleted through weak reference mechanism, you could
check the code of ContextCleaner, also you could trigger a full gc manually
with JVisualVM or some other tools to see if shuffle files are deleted.
Thanks
Jerry
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent:
Thanks Akhil,Mark for your valuable comments.
Problem resolved.
AT
On Tue, Jun 9, 2015 at 2:17 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I think Yes, as the documentation says Creates tuples of the elements
in this RDD by applying f.
Thanks
Best Regards
On Tue, Jun 9, 2015 at
Hi 罗辉
I think you interpret the logs wrong.
Your program actually runs from this point: (Rest of them are just starting
up stuffs and connecting)
15/06/08 16:14:22 INFO broadcast.TorrentBroadcast: Started reading
broadcast variable 0
15/06/08 16:14:23 INFO storage.MemoryStore:
hi akhil
Not exactly ,the task took 54s to finish, started from 16:14:02 and ended at
16:14:56.
within this 54s , it needs 19s to store value in memory, which started from
16:14:23 and ended at 16:14:42. I think this is the most time-wasting part of
this task ,also unreasonable.You may check
Jerry, I agree with you.
However, in my case, I kept the monitoring the blockmanager folder. I
do see sometimes the number of files decreased, but the folder's size
kept increasing.
And below is a screenshot of the folder. You can see some old files are
not deleted somehow.
It would be even faster to load the data on the driver and sort it there
without using Spark :). Using reduce() is cheating, because it only works
as long as the data fits on one machine. That is not the targeted use case
of a distributed computation system. You can repeat your test with more
data
Possibly in future, if and when spark architecture allows workers to launch
spark jobs (the functions passed to transformation or action APIs of RDD),
it will be possible to have RDD of RDD.
On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar loni...@gmail.com wrote:
Simillar question was asked
Hi,
I posted a question with regards to Phoenix and Spark Streaming on
StackOverflow [1]. Please find a copy of the question to this email below the
first stack trace. I also already contacted the Phoenix mailing list and tried
the suggestion of setting spark.driver.userClassPathFirst.
I think Yes, as the documentation says Creates tuples of the elements in
this RDD by applying f.
Thanks
Best Regards
On Tue, Jun 9, 2015 at 1:54 PM, amit tewari amittewar...@gmail.com wrote:
Actually the question was will keyBy() take accept multiple fields (eg
x(0), x(1)) as Key?
On
Simillar question was asked before:
http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
Here is one of the reasons why I think RDD[RDD[T]] is not possible:
- RDD is only a handle to the actual data partitions. It has a
reference/pointer to the *SparkContext* object
Hi,
Are you restarting your Spark streaming context through getOrCreate?
On 9 Jun 2015 09:30, Haopu Wang hw...@qilinsoft.com wrote:
When I ran a spark streaming application longer, I noticed the local
directory's size was kept increasing.
I set spark.cleaner.ttl to 1800 seconds in order
Replicating my answer to another question asked today:
Here is one of the reasons why I think RDD[RDD[T]] is not possible:
* RDD is only a handle to the actual data partitions. It has a
reference/pointer to the /SparkContext /object (/sc/) and a list of
partitions.
* The SparkContext is an
From the stack I think this problem may be due to the deletion of broadcast
variable, as you set the spark.cleaner.ttl, so after this timeout limit, the
old broadcast variable will be deleted, you will meet this exception when you
want to use it again after that time limit.
Basically I think
Only 1 minor GC, 0.07s.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Akhil Das ak...@sigmoidanalytics.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: How to decrease the time of storing block in memory
日期:2015年06月09日 15点02分
You can put a Thread.sleep(10) in the code to have the UI available for
quiet some time. (Put it just before starting any of your transformations)
Or you can enable the spark history server
https://spark.apache.org/docs/latest/monitoring.html too. I believe --jars
Is it that task taking 19s? It won't be simply taking 19s to store 2KB of
data into memory there could be other operations happening too (the
transformations that you are doing), It would be good if you can paste the
code snippet that you are running to have a better understanding.
Thanks
Best
Thanks Akhil:
The driver fails so fast to get a look at 4040. Is there any other way to see
the download and ship process of the files?
Is driver supposed to download these jars from HDFS to some location, then ship
them to excutors?
I can see from log that the driver downloaded the
Actually the question was will keyBy() take accept multiple fields (eg
x(0), x(1)) as Key?
On Tue, Jun 9, 2015 at 1:07 PM, amit tewari amittewar...@gmail.com wrote:
Thanks Akhil, as you suggested, I have to go keyBy(route) as need the
columns intact.
But wil keyBy() take accept multiple
I couldn't find any solution. I can write but I can't read from Cassandra.
2015-06-09 8:52 GMT+03:00 Yasemin Kaya godo...@gmail.com:
Thanks alot Mohammed, Gerard and Yana.
I can write to table, but exception returns me. It says *Exception in
thread main java.io.IOException: Failed to open
On 8 Jun 2015, at 15:55, Richard Marscher
rmarsc...@localytics.commailto:rmarsc...@localytics.com wrote:
Hi,
we've been seeing occasional issues in production with the FileOutCommitter
reaching a deadlock situation.
We are writing our data to S3 and currently have speculation enabled. What
Is it the large result set return from the Thrift Server? And can you paste the
SQL and physical plan?
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, June 9, 2015 12:01 PM
To: Sourav Mazumder
Cc: user
Subject: Re: Spark SQL with Thrift Server is very very slow and finally failing
Try this way:
scalaval input1 = sc.textFile(/test7).map(line =
line.split(,).map(_.trim));
scalaval input2 = sc.textFile(/test8).map(line =
line.split(,).map(_.trim));
scalaval input11 = input1.map(x=(*(x(0) + x(1)*),x(2),x(3)))
scalaval input22 = input2.map(x=(*(x(0) + x(1)*),x(2),x(3)))
scala
like this?
myDStream.foreachRDD(rdd = rdd.saveAsTextFile(/sigmoid/, codec ))
Thanks
Best Regards
On Mon, Jun 8, 2015 at 8:06 PM, Bob Corsaro rcors...@gmail.com wrote:
It looks like saveAsTextFiles doesn't support the compression parameter of
RDD.saveAsTextFile. Is there a way to add the
Cheng,
yes, it works, I set the property in SparkConf before initiating
SparkContext.
The property name is spark.hadoop.dfs.replication
Thanks fro the help!
-Original Message-
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Monday, June 08, 2015 6:41 PM
To: Haopu Wang; user
Cheng you were right. It works when I remove the field from either one. I
should have checked the types beforehand. What confused me is that Spark
attempted to join it and midway threw the error. It isn't quite there yet.
Thanks for the help.
On Mon, Jun 8, 2015 at 8:29 PM Cheng Lian
May be you should check in your driver UI and see if there's any GC time
involved etc.
Thanks
Best Regards
On Mon, Jun 8, 2015 at 5:45 PM, luohui20...@sina.com wrote:
hi there
I am trying to descrease my app's running time in worker node. I
checked the log and found the most
When I ran a spark streaming application longer, I noticed the local
directory's size was kept increasing.
I set spark.cleaner.ttl to 1800 seconds in order clean the metadata.
The spark streaming batch duration is 10 seconds and checkpoint duration
is 10 minutes.
The setting took effect but
Thanks Akhil, as you suggested, I have to go keyBy(route) as need the
columns intact.
But wil keyBy() take accept multiple fields (eg x(0), x(1))?
Thanks
Amit
On Tue, Jun 9, 2015 at 12:26 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Try this way:
scalaval input1 =
Once you submits the application, you can check in the driver UI (running
on port 4040) Environment Tab to see whether those jars you added got
shipped or not. If they are shipped and still you are getting NoClassDef
exceptions then it means that you are having a jar conflict which you can
resolve
Hi
I am little confused here. If I am writing to HDFS,shouldn't HDFS
replication factor will automatically kick in? In other words, how spark
writer is different than a hdfs -put commnd (from perspective of HDFS, of
course)?
Best
Ayan
On Tue, Jun 9, 2015 at 5:17 PM, Haopu Wang
Yeah, this does look confusing. We are trying to improve the error
reporting by catching similar issues at the end of the analysis phase
and give more descriptive error messages. Part of the work can be found
here:
For a research project, I tried sorting the elements in an RDD. I did this in
two different approaches.
In the first method, I applied a mapPartitions() function on the RDD, so
that it would sort the contents of the RDD, and provide a result RDD that
contains the sorted list as the only record in
Yes my cassandra is listening on 9160 I think. Actually I know from yaml
file. The file includes :
rpc_address: localhost
# port for Thrift to listen for clients on
rpc_port: 9160
I check the port nc -z localhost 9160; echo $? it returns me 0. I think
it close, should I open this port ?
Would you please provide a snippet that reproduce this issue? What
version of Spark were you using?
Cheng
On 6/9/15 8:18 PM, bipin wrote:
Hi,
When I try to save my data frame as a parquet file I get the following
error:
java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to
Hi All,
I was hoping somebody might be able to help out,
I currently have a network built using graphx which looks like the following
(only with a much larger number of vertices and edges)
Vertices
ID, Attribute1, Attribute2
1001 2 0
1002 1 0
1003 2 1
1004 3 2
1006 4 0
1007 5 1
Sorry my answer I hit terminal lsof -i:9160: result is
lsof -i:9160
COMMAND PIDUSER FD TYPE DEVICE SIZE/OFF NODE NAME
java7597 inosens 101u IPv4 85754 0t0 TCP localhost:9160
(LISTEN)
so 9160 port is available or not ?
2015-06-09 17:16 GMT+03:00 Yasemin Kaya
Hi all,
I'm manually building Spark from source against 1.4 branch and submitting
the job against Yarn. I am seeing very strange behavior. The first 2 or 3
times I submit the job, it runs fine, computes Pi, and exits. The next time
I run it, it gets stuck in the ACCEPTED state.
I'm kicking off a
Hi,
I'm trying to join DStream with interval let say 20s, join with RDD loaded
from HDFS folder which is changing periodically, let say new file is coming
to the folder for every 10 minutes.
How should it be done, considering the HDFS files in the folder is
periodically changing/adding new
Is it possible bound costs of operations such as flatMap(), collect() based
on the size of RDDs?
This may or may not be helpful for your classpath issues, but I wanted to
verify that basic functionality worked, so I made a sample app here:
https://github.com/jmahonin/spark-streaming-phoenix
This consumes events off a Kafka topic using spark streaming, and writes
out event counts to Phoenix
Seems that you're using a DB2 Hive metastore? I'm not sure whether Hive
0.12.0 officially supports DB2, but probably not? (Since I didn't find
DB2 scripts under the metastore/scripts/upgrade folder in Hive source tree.)
Cheng
On 6/9/15 8:28 PM, Needham, Guy wrote:
Hi,
I’m using Spark 1.3.1
If your application is stuck in that state, it generally means your cluster
doesn't have enough resources to start it.
In the RM logs you can see how many vcores / memory the application is
asking for, and then you can check your RM configuration to see if that's
currently available on any single
Thank you for you responses!
You mention that it only works as long as the data fits on a single
machine. What I am tying to do is receive the sorted contents of my
dataset. For this to be possible, the entire dataset should be able to fit
on a single machine. Are you saying that sorting the
I am trying to implement top-k in scala within apache spark. I am aware that
spark has a top action. But, top() uses reduce(). Instead, I would like to
use treeReduce(). I am trying to compare the performance of reduce() and
treeReduce().
The main issue I have is that I cannot use these 2 lines
hm. Yeah, your port is good...have you seen this thread:
http://stackoverflow.com/questions/27288380/fail-to-use-spark-cassandra-connector
? It seems that you might be running into version mis-match issues?
What versions of Spark/Cassandra-connector are you trying to use?
On Tue, Jun 9, 2015 at
That would constitute a major change in Spark's architecture. It's not
happening anytime soon.
On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar loni...@gmail.com wrote:
Possibly in future, if and when spark architecture allows workers to
launch spark jobs (the functions passed to transformation
Are you saying that sorting the entire data and collecting it on the
driver node is not a typical use case?
It most definitely is not. Spark is designed and intended to be used with
very large datasets. Far from being typical, collecting hundreds of
gigabytes, terabytes or petabytes to the
Correct. Trading away scalability for increased performance is not an
option for the standard Spark API.
On Tue, Jun 9, 2015 at 3:05 AM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
It would be even faster to load the data on the driver and sort it there
without using Spark :).
I found that the problem was due to garbage collection in filter(). Using
Hive to do the filter solved the problem.
A lot of other problems went away when I upgraded to Spark 1.2.0, which
compresses various task overhead data (HighlyCompressedMapStatus etc.).
It has been running very very
Hi Marcelo,
Thanks. I think something more subtle is happening.
I'm running a single-node cluster, so there's only 1 NM. When I executed
the exact same job the 4th time, the cluster was idle, and there was
nothing else being executed. RM currently reports that I have 6.5GB of
memory and 4 cpus
My jar files are:
cassandra-driver-core-2.1.5.jar
cassandra-thrift-2.1.3.jar
guava-18.jar
jsr166e-1.1.0.jar
spark-assembly-1.3.0.jar
spark-cassandra-connector_2.10-1.3.0-M1.jar
spark-cassandra-connector-java_2.10-1.3.0-M1.jar
spark-core_2.10-1.3.1.jar
spark-streaming_2.10-1.3.1.jar
And my code
It is strange that writes works but read does not. If it was a Cassandra
connectivity issue, then neither write or read would work. Perhaps the problem
is somewhere else.
Can you send the complete exception trace?
Also, just to make sure that there is no DNS issue, try this:
Thanks Ayan, I used beeline in Spark to connect to Hiveserver2 that I
started from my Hive. So as you said, It is really interacting with Hive as
a typical 3rd party application, and it is NOT using Spark execution
engine. I was thinking that it gets metastore info from Hive, but uses
Spark to
Hm, jars look ok, although it's a bit of a mess -- you have spark-assembly
1.3.0 but then core and streaming 1.3.1...It's generally a bad idea to mix
versions. Spark-assembly bundless all spark packages, so either do them
separately or use spark-assembly but don't mix like you've shown.
As to the
I am trying to use Spark 1.3 (Standalone) against Hive 1.2 running on
Hadoop 2.6.
I looked the ThriftServer2 logs, and I realized that the server was not
starting properly, because of failure in creating a server socket. In fact,
I had passed the URI to my Hiveserver2 service, launched from Hive,
Hello,
While trying to link kafka to spark, I'm not able to get data from kafka.
This is the error that I'm getting from spark logs:
ERROR EndpointWriter: dropping message [class
akka.actor.ActorSelectionMessage] for non-local recipient
[Actor[akka.tcp://sparkMaster@localhost:7077/]] arriving at
Hi,
I don't have a complete answer to your questions but:
Removing the suffix does not solve the problem - unfortunately this is
true, the master web UI only tries to build out a Spark UI from the event
logs once, at the time the context is closed. If the event logs are
in-progress at this time,
I removed core and streaming jar. And the exception still same.
I tried what you said then results:
~/cassandra/apache-cassandra-2.1.5$ bin/cassandra-cli -h localhost -p 9160
Connected to: Test Cluster on localhost/9160
Welcome to Cassandra CLI version 2.1.5
The CLI is deprecated and will be
Hi Dibyendu,
Thank you for your reply.
I am using Kafka https://github.com/dibbhatt/kafka-spark-consumer which
uses spark-core and spark-streaming *1.2.2*
Spark cluster on which I am running application is* 1.3.1* . I will test it
with latest changes .
Yes Underlying BlockManager gives error
Yes! If I either specify a different queue or don't specify a queue at all,
it works.
On Tue, Jun 9, 2015 at 4:25 PM, Marcelo Vanzin van...@cloudera.com wrote:
Does it work if you don't specify a queue?
On Tue, Jun 9, 2015 at 1:21 PM, Matt Kapilevich matve...@gmail.com
wrote:
Hi Marcelo,
From the RM scheduler, I see 3 applications currently stuck in the
root.thequeue queue.
Used Resources: memory:0, vCores:0
Num Active Applications: 0
Num Pending Applications: 3
Min Resources: memory:0, vCores:0
Max Resources: memory:6655, vCores:4
Steady Fair Share: memory:1664, vCores:0
Hi All,
I have some code to access s3 from Spark. The code is as simple as:
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
Configuration hadoopConf = ctx.hadoopConfiguration();
// aws.secretKey=Zqhjim3GB69hMBvfjh+7NX84p8sMF39BHfXwO3Hs
Hi User group,
We are using spark Linear Regression with SGD as the optimization technique and
we are achieving very sub-optimal results.
Can anyone shed some light on why this implementation seems to produce such
poor results vs our own implementation?
We are using a very small dataset, but
Apologies, I see you already posted everything from the RM logs that
mention your stuck app.
Have you tried restarting the YARN cluster to see if that changes anything?
Does it go back to the first few tries work behaviour?
I run 1.4 on top of CDH 5.4 pretty often and haven't seen anything like
Does it work if you don't specify a queue?
On Tue, Jun 9, 2015 at 1:21 PM, Matt Kapilevich matve...@gmail.com wrote:
Hi Marcelo,
Yes, restarting YARN fixes this behavior and it again works the first few
times. The only thing that's consistent is that once Spark job submissions
stop working,
Hi Frank,
Thanks for the reply. I downloaded ADAM and built it but it does not seem
to list this function for command line options.
Are these exposed as public API and I can call it from code ?
Also , I need to save all my intermediate data. Seems like ADAM stores
data in Parquet on HDFS.
I want
Hi Marcelo,
Yes, restarting YARN fixes this behavior and it again works the first few
times. The only thing that's consistent is that once Spark job submissions
stop working, it's broken for good.
On Tue, Jun 9, 2015 at 4:12 PM, Marcelo Vanzin van...@cloudera.com wrote:
Apologies, I see you
In my test data, I have a JavaRDD with a single String(size of this RDD is
1).
On a 3 node Yarn cluster, mapToPair function on this RDD sends the same
input String to 2 different nodes. Container logs on these nodes show the
same string as input.
Overriding default partition count by
On Tue, Jun 9, 2015 at 11:31 AM, Matt Kapilevich matve...@gmail.com wrote:
Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now -
this problem is specific to Spark.
That doesn't necessarily mean anything. Spark apps have different resource
requirements than Hadoop apps.
1) Could you share your command?
2) Are the kafka brokers on the same host?
3) Could you run a --describe on the topic to see if the topic is setup
correctly (just to be sure)?
--
View this message in context:
You should try, from the SparkConf object, to issue a get.
I don't have the exact name for the matching key, but from reading the code
in SparkSubmit.scala, it should be something like:
conf.get(spark.executor.instances)
--
View this message in context:
By writing PDF files, do you mean something equivalent to a hadoop fs -put
/path?
I'm not sure how Pdfbox works though, have you tried writing individually
without spark?
We can potentially look if you have established that as a starting point to
see how Spark can be interfaced to write to HDFS.
I see the other jobs SUCCEEDED without issues.
Could you snapshot the FairScheduler activity as well?
My guess it, with the single core, it is reaching a NodeManager that is
still busy with other jobs and the job ends up in a waiting state.
Does the job eventually complete?
Could you
I would like to write pdf files using pdfbox to HDFS from my Spark
application. Can this be done?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-a-Spark-App-run-with-spark-submit-write-pdf-files-to-HDFS-tp23233.html
Sent from the Apache Spark User
I've tried running a Hadoop app pointing to the same queue. Same thing now,
the job doesn't get accepted. I've cleared out the queue and killed all the
pending jobs, the queue is still unusable.
It seems like an issue with YARN, but it's specifically Spark that leaves
the queue in this state.
I figured it out *in case anyone else has this problem in the future.
spark-submit --driver-class-path lib/postgresql-9.4-1201.jdbc4.jar
--packages com.databricks:spark-csv_2.10:1.0.3 path/to/my/script.py
What I found is that you MUST put the path to your script at the end of the
spark-submit
Looks like the real culprit is a library version mismatch:
Caused by: java.lang.NoSuchMethodError:
org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(Ljava/lang/String;I)Lorg/apache/thrift/transport/TTransport;
at
Hi Stephen
How many is a very large number of iterations? SGD is notorious for requiring
100s or 1000s of iterations, also you may need to spend some time tweaking the
step-size. In 1.4 there is an implementation of ElasticNet Linear Regression
which is supposed to compare favourably with an
Hi,
When I try to save my data frame as a parquet file I get the following
error:
java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to
org.apache.spark.sql.types.Decimal
at
org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)
Hi,
I'm using Spark 1.3.1 to insert into a Hive 0.12 table from a SparkSQL query.
The query is a very simple select from a dummy Hive table used for benchmarking.
I'm using a create table as statement to do the insert. No matter if I do that
or an insert overwrite, I get the same Hive
Having the following code in RDD.scala works for me. PS, in the following
code, I merge the smaller queue into larger one. I wonder if this will help
performance. Let me know when you do the benchmark.
def treeTakeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
if (num ==
With Spark Streaming, I am maintaining a state (updateStateByKey every 30s) and
emitting to file parts of that state that have been closed every 5 minutes, but
only care about the last state collected.
In 5m, there will be 10 updateStateByKey iterations called.
For example:
…
val ssc = new
I don't know anything about your use case, so take this with a grain of
salt, but typically if you are operating at a scale that benefits from
Spark, then you likely will not want to write your output records as
individual files into HDFS. Spark has built-in support for the Hadoop
SequenceFile
I agree with Richard. It looks like the issue here is shuffling, and
shuffle data is always written to disk, so the issue is definitely not that
all the output of flatMap has to be stored in memory.
If at all possible, I'd first suggest upgrading to a new version of spark
-- even in 1.2, there
I am using Spark (standalone) to run queries (from a remote client) against
data in tables that are already defined/loaded in Hive.
I have started metastore service in Hive successfully, and by putting
hive-site.xml, with proper metastore.uri, in $SPARK_HOME/conf directory, I
tried to share its
As Robin suggested, you may try the following new implementation.
https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef
Thanks.
Sincerely,
DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
Is your cassandra installation actually listening on 9160?
lsof -i :9160COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java29232 ykadiysk 69u IPv4 42152497 0t0 TCP
localhost:9160 (LISTEN)
I am running an out-of-the box cassandra conf where
rpc_address: localhost
#
At which point would I call cache()? I just want the runtime to spill to
disk when necessary without me having to know when the necessary is.
On Thu, Jun 4, 2015 at 9:42 AM, Cody Koeninger c...@koeninger.org wrote:
direct stream isn't a receiver, it isn't required to cache data anywhere
From log file I noticed that the ExecutorLostFailure happens after the
memory used by Executor becomes more than the Executor memory value.
However, even if I increase the value of Executor Memory the Executor fails
- only that it takes longer time.
I'm wondering that for joining 2 Hive tables,
Hi,
I have configured Spark to run on YARN. Whenever I start spark shell using
'spark-shell' command, it automatically gets killed. Output looks like
below:
ubuntu@dev-cluster-gateway:~$ ls shekhar/
edx-spark
ubuntu@dev-cluster-gateway:~$ spark-shell
Welcome to
__
/
hi community,
i want append results to one file. if i work local my function build all
right,
if i run this on a yarn cluster, i lost same rows.
here my function to write:
points.foreach(
new VoidFunctionTuple2Integer, GeoTimeDataTupel() {
private static final long
Yes true. That's why I said if and when.
But hopefully I have given correct explanation of why rdd of rdd is not
possible.
On 09-Jun-2015 10:22 pm, Mark Hamstra m...@clearstorydata.com wrote:
That would constitute a major change in Spark's architecture. It's not
happening anytime soon.
On
Thanks So much!
I did put sleep on my code to have the UI available.
Now from the UI, I can see:
· In the “SparkProperty” Section, the spark.jars and spark.files are
set as what I want.
· In the “Classpath Entries” Section, my jars and files paths are
there(with a HDFS path)
I am not sure they work with HDFS pathes. You may want to look at the
source code. Alternatively you can create a fat jar containing all jars
(let your build tool set correctly METAINF). This always works.
Le mer. 10 juin 2015 à 6:22, Dong Lei dong...@microsoft.com a écrit :
Thanks So much!
Hi Jörn:
I start to check code and sadly it seems it does not work hdfs path:
In HTTPFileServer.scala:
def addFileToDir:
….
Files.copy
….
It looks like it only copy file from local to
96 matches
Mail list logo