On a single node, you can easily achieve 10s of thousands of key-value
inserts per second. Depending on how many columns are in each row, 600 a
second is rather slow :)
Your loop looks good. Using a single BatchWriter and letting it amortize
sending data from your client to the servers will be the most efficient.
If the JSON parsing is the slowest part, you could consider a single
thread reading the file and provide the line to a thread pool, parse the
line and add it to some concurrent data structure. You could have a
consumer on that data structure reading each parsed object and sending
it to Accumulo.
Alternatively, this is where MapReduce is a clear win as it's very good
at parallelizing these types of problems. You could use the
FileInputFormat and the AccumuloOutputFormat to accomplish this task.
Andrea Leoni wrote:
Thank you for your answer.
Today i tried to create a big command file and push it to shell (about 300k
insert per file). As you said it is too slow for me (about 600 inserted
row/sec)
I'm on Accumulo by just one week. I'm a noob but i'm learning.
Actually my app has to store a large number of data.
The row is the timestamp and the family/qualif are the column... I catch my
data from a JSON file, so my app scan it for new records, parse it and once
for record create a mutation and push it on Accumulo with batchWriter...
Maybe I wrong something that can increase the speed of my inserts.
Actually I:
LOOP
1) read a json line
2) parse it
3) create a mutation
4) put in this mutation the line's information
5) use batchWriter to insert mutation in Accumulo
END LOOP
Is it all right? I now that point 1) and 2) are slow but it's necessary and
i use the fastest json parser i've found online.
Thank you so much again!
(and sorry again for my bad english!)
-----
Andrea Leoni
Italy
Computer Engineering
--
View this message in context:
http://apache-accumulo.1065345.n5.nabble.com/Ingest-speed-tp14005p14013.html
Sent from the Developers mailing list archive at Nabble.com.