Hello,
I submit spark streaming inside Yarn, I have configured yarn to generate custom
logs.
It works fine and yarn aggregate very well the logs inside HDFS, nevertheless
the log files are only usable via "yarn logs" command.
I would prefer to be able to navigate inside via hdfs command like a te
4 partitions.
- Mail original -
De: "Dibyendu Bhattacharya"
À: "Nicolas Biau"
Cc: "Cody Koeninger" , "user"
Envoyé: Dimanche 4 Octobre 2015 16:51:38
Objet: Re: Spark Streaming over YARN
How many partitions are there in your Kafka topic ?
Regards,
Dibyendu
On Sun, Oct 4, 2015 at
Hello,
I am using https://github.com/dibbhatt/kafka-spark-consumer
I specify 4 receivers in the ReceiverLauncher , but in YARN console I can see
one node receiving the kafka flow.
(I use spark 1.3.1)
Tks
Nicolas
- Mail original -
De: "Dibyendu Bhattacharya"
À: nib...@free.fr
Cc: "Cod
Thanks a lot, why you said "the most recent version" ?
- Mail original -
De: "Jörn Franke"
À: "nibiau"
Cc: banto...@gmail.com, user@spark.apache.org
Envoyé: Samedi 3 Octobre 2015 13:56:43
Objet: Re: RE : Re: HDFS small file generation problem
Yes the m
Hello,
Thanks if I understand correctly Hive can be a usable to my context ?
Nicolas
Envoyé depuis mon appareil mobile SamsungJörn Franke a
écrit :If you use transactional tables in hive together with insert, update,
delete then it does the "concatenate " for you automatically in regularly
Hello,
So, does Hive is a solution for my need :
- I receive small messages (10KB) identified by ID (product ID for example)
- Each message I receive is the last picture of my product ID, so I just want
basically to store last picture products inside HDFS
in order to process batch on it later.
I
Hello,
Finally Hive is not a solution as I cannot update the data.
And for archive file I think it would be the same issue.
Any other solutions ?
Nicolas
- Mail original -
De: nib...@free.fr
À: "Brett Antonides"
Cc: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:37:22
Objet: Re
Ok thanks, but can I also update data instead of insert data ?
- Mail original -
De: "Brett Antonides"
À: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:18:18
Objet: Re: HDFS small file generation problem
I had a very similar problem and solved it with Hive and ORC files
Sorry, I just said that I NEED to manage offsets, so in case of Kafka Direct
Stream , how can I handle this ?
Update Zookeeper manually ? why not but any other solutions ?
- Mail original -
De: "Cody Koeninger"
À: "Nicolas Biau"
Cc: "user"
Envoyé: Vendredi 2 Octobre 2015 18:29:09
Obje
Ok so if I set for example 4 receivers (number of nodes), how RDD will be
distributed over the nodes/core.
For example in my example I have 4 nodes (with 2 cores)
Tks
Nicolas
- Mail original -
De: "Dibyendu Bhattacharya"
À: nib...@free.fr
Cc: "Cody Koeninger" , "user"
Envoyé: Vendre
Hello,
Yes but :
- In the Java API I don't find a API to create a HDFS archive
- As soon as I receive a message (with messageID) I need to replace the old
existing file by the new one (name of file being the messageID), is it possible
with archive ?
Tks
Nicolas
- Mail original -
De: "Jö
>From my understanding as soon as I use YARN I don't need to use parrallelisme
>(at least for RDD treatment)
I don't want to use direct stream as I have to manage the offset positionning
(in order to be able to start from the last offset treated after a spark job
failure)
- Mail original
Hello,
I have a job receiving data from kafka (4 partitions) and persisting data
inside MongoDB.
It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores)
only on node is receiving all the kafka partitions and only one node is
processing my RDD treatment (foreach function)
H
Hello,
I'm still investigating my small file generation problem generated by my Spark
Streaming jobs.
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb),
and I have to store them inside HDFS in order to treat them by PIG jobs
on-demand.
The problem is the fact that I
Hello,
I used a custom receiver in order to receive JMS messages from MQ Servers.
I want to benefit of Yarn cluster, my questions are :
- Is it possible to have only one node receiving JMS messages and parralelize
the RDD over all the cluster nodes ?
- Is it possible to parallelize also the messa
Hello,
Please could you explain me what is exactly distributed when I launch a spark
streaming job over YARN cluster ?
My code is something like :
JavaDStream customReceiverStream =
ssc.receiverStream(streamConfig.getJmsReceiver());
JavaDStream incoming_msg = customReceiverStream.map(
Hello,
I have spark application with a JMS receiver.
Basically my application does :
JavaDStream incoming_msg = customReceiverStream.map(
new Function()
{
public String call(JMSEvent jmsEve
HAR archive seems a good idea , but just a last question to be sure to do the
best choice :
- Is it possible to override (remove/replace) a file inside the HAR ?
Basically the name of my small files will be the keys of my records , and
sometimes I will need to replace the content of a file by a n
My main question in case of HAR usage is , is it possible to use Pig on it and
what about performances ?
- Mail original -
De: "Jörn Franke"
À: nib...@free.fr, user@spark.apache.org
Envoyé: Jeudi 3 Septembre 2015 15:54:42
Objet: Re: Small File to HDFS
Store them as hadoop archive (ha
Ok but so some questions :
- Sometimes I have to remove some messages from HDFS (cancel/replace cases) ,
is it possible ?
- In the case of a big zip file, is it possible to easily process Pig on it
directly ?
Tks
Nicolas
- Mail original -
De: "Tao Lu"
À: nib...@free.fr
Cc: "Ted Yu" , "
Hi,
I already store them in MongoDB in parralel for operational access and don't
want to add an other database in the loop
Is it the only solution ?
Tks
Nicolas
- Mail original -
De: "Ted Yu"
À: nib...@free.fr
Cc: "user"
Envoyé: Mercredi 2 Septembre 2015 18:34:17
Objet: Re: Small File
Hello,
I'am currently using Spark Streaming to collect small messages (events) , size
being <50 KB , volume is high (several millions per day) and I have to store
those messages in HDFS.
I understood that storing small files can be problematic in HDFS , how can I
manage it ?
Tks
Nicolas
--
Hello,
I am new user of Spark, and need to know what could be the best practice to do
the following scenario :
- Spark Streaming receives XML messages from Kafka
- Spark transforms each message of the RDD (xml2json + some enrichments)
- Spark store the transformed/enriched messages inside MongoDB
Hello,
I want to override the log4j configuration when I start my spark job.
I tried :
.../bin/spark-submit --class --conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/.../log4j.properties"
x.jar
or
.../bin/spark-submit --class --conf
"spark.executor.extraJavaOptio
Hello,
I'm evaluating Spark/SparkStreaming .
I use SparkStreaming to receive messages from a Kafka topic.
As soon as I have a JavaReceiverInputDStream , I have to treat each message,
for each one I have to search in MongoDB to find if a document does exist.
If I found the document I have to update
25 matches
Mail list logo