Hi Christian,

Good day ahead!
I have a query , Have loaded the 40million calladataset after fixing enomem 
error using your erlang script.  I am trying to query the same.using

$ time curl -X POST http://127.0.0.1:8098/mapred -H "Content-Type: 
application/json" -d @-
{"inputs":[["CustCalls40m","1"]],
 "query":[{"map":{"language":"javascript","name":"Riak.mapValues","keep":true}}]
}


curl: (7) couldn't connect to host

real    0m10.321s
user    0m0.002s
sys     0m0.002s

$ netstat -at | grep 8098
Nothig comes.

Confirmed the completion using
$ time ./load_datatest CustCalls40mnew.csv        >>40milli.txt &
[2] 8680
$
real    6376m15.918s=106.26 hours=4.42 days
user    2392m21.438s
sys     4152m51.748s
tail -20 40milli.txt

inserting: 39999989
Inserting: 39999990
Inserting: 39999991
Inserting: 39999992
Inserting: 39999993
Inserting: 39999994
Inserting: 39999995
Inserting: 39999996
Inserting: 39999997
Inserting: 39999998
Inserting: 39999999
Inserting: 40000000

Please  need your valuable suggestion in correcting the same Christy.

Thanks & regards
Sangeetha



-----Original Message-----
From: Christian Dahlqvist [mailto:[email protected]]
Sent: Wednesday, October 10, 2012 1:06 PM
To: Pattabi Raman, Sangeetha (Cognizant)
Cc: [email protected]; [email protected]
Subject: Re: riak memstore clarification on enomem error

On 10/10/2012 05:54, [email protected] wrote:
> Good ' day Christian!
>          I have two doubts.Please requesting you to do clear me on the same 
> .Thanks in advance .
> 1. riak -took 7733m32.525s (nearly 5.3 days) for loading 35 million (1.8 
> sdata set)which uses single curl -one node for storage .....
> Is there a provision in the below script to make use of two curl's in using 
> second node also in the 2-node riak cluster ,wherein its now only one node.
> 2.we deal with bigdata and I need to load max 500million(35 GB) data since my 
> task involves comparision of various loads on different data bases such 
> as(hadoop,mongodb,Cassandra db and now riak) finished ensuring 500million on 
> hadoop ,mongo,Cassandra..struck of on riak since it takes loads of time in 
> loading ...could u please help me on the same in making use of second node 
> also in the below script of yours.
> Regards
> sangeetha
>
> -----Original Message-----
> From: Christian Dahlqvist [mailto:[email protected]]
> Sent: Tuesday, October 09, 2012 3:57 PM
> To: Pattabi Raman, Sangeetha (Cognizant)
> Cc: [email protected]; [email protected]
> Subject: Re: riak memstore clarification on enomem error
>
> On 09/10/2012 10:39, [email protected] wrote:
>> Thanks Shane ,
>>
>> Load script used is as follows (basically a curl)
>>
>> #!/usr/local/bin/escript
>> main([Filename]) ->
>>       {ok, Data} = file:read_file(Filename),
>>       Lines = tl(re:split(Data, "\r?\n", [{return, binary},trim])),
>>       lists:foreach(fun(L) -> LS = re:split(L, ","), format_and_insert(LS) 
>> end, Lines).
>>
>> format_and_insert(Line) ->
>>       JSON = 
>> io_lib:format("{\"id\":\"~s\",\"phonenumber\":~s,\"callednumber\":~s,\"starttime\":~s,\"endtime\":~s,\"status\":~s}",
>>  Line),
>>       Command = io_lib:format("curl -X PUT 
>> http://127.0.0.1:8098/riak/CustCalls35m/~s -d '~s' -H 'content-type: 
>> application/json'", [hd(Line),JSON]),
>>       io:format("Inserting: ~s~n", [hd(Line)]),
>>       os:cmd(Command).
>>
>>
>>>> you are right shane .after Loading it I confirm the same by querying the 
>>>> (1.8GB)35 million dataset with first ,middle and last row 
>>>> value(1,15000000,35000000) with id column.hence confirmed its stored onto 
>>>> CustCalls35m bucket of riak db.
>> Regards
>> Sangeetha
>>
>>
>> -----Original Message-----
>> From: riak-users [mailto:[email protected]] On
>> Behalf Of Shane McEwan
>> Sent: Tuesday, October 09, 2012 3:00 PM
>> To: [email protected]
>> Subject: Re: riak memstore clarification on enomem error
>>
>> G'day Sangeetha.
>>
>> On 09/10/12 07:40, [email protected] wrote:
>>> Dear Team ,
>>>
>>>                     I have a 64 GB RAM ,during the Load of 35
>>> million dataset (1.8 GB) it consumes nearly 40-45 GB of RAM durial
>>> the startup of the erlang script ,but
>>>
>>> While trying to load 40 million dataset (2.1 GB) I am getting  the
>>> following error
>>>
>>> *escript: exception error: no match of right hand side value
>>> {error,enomem}**,*
>> The error message is coming from escript and not Riak. It's just a guess but 
>> could it be that the script you're using to load your data into Riak is 
>> trying to load all the data into memory before sending it to Riak?
>> Can you break your dataset into smaller chunks and load them separately?
>> Or send the data to Riak as you read it from the dataset without storing it 
>> all in memory?
>>
>>> *2.**Is there a provision to make use of the swap memory in riak
>>> config?*
>> Using swap in this situation is almost always a bad idea. Your script will 
>> end up running so slowly you will be waiting for days, maybe months, for 
>> your data to load.
>>
>> Shane.
>>
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> This e-mail and any files transmitted with it are for the sole use of the 
>> intended recipient(s) and may contain confidential and privileged 
>> information. If you are not the intended recipient(s), please reply to the 
>> sender and destroy all copies of the original message. Any unauthorized 
>> review, use, disclosure, dissemination, forwarding, printing or copying of 
>> this email, and/or any action taken in reliance on the contents of this 
>> e-mail is strictly prohibited and may be unlawful.
>>
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> Hi,
>
> That script does indeed load all lines into memory before processing them one 
> by one. Try something like this instead:
>
> #!/usr/local/bin/escript
> main([Filename]) ->
>       {ok, IoDev} = file:open(Filename, [read, raw, binary, {read_ahead, 
> 65536}]),
>       process_file(IoDev).
>
> process_file(IoDev) ->
>       case file:read_line(IoDev) of
>           {ok, Data} ->
>               Line = strip_and_split(Data),
>               JSON =
> io_lib:format("{\"id\":\"~s\",\"phonenumber\":~s,\"callednumber\":~s,\
> "starttime\":~s,\"endtime\":~s,\"status\":~s}",
> Line),
>               Command = io_lib:format("curl -X PUT 
> http://127.0.0.1:8098/riak/CustCalls35m/~s -d '~s' -H 'content-type:
> application/json'", [hd(Line),JSON]),
>               io:format("Inserting: ~s~n", [hd(Line)]),
>               os:cmd(Command),
>               process_file(IoDev);
>           eof ->
>               ok;
>           {error, Reason} ->
>               io:format("Error processing file: ~p~n", [Reason]),
>               error
>       end.
>
> strip_and_split(Line) ->
>       [L | _] = re:split(Line, "\n"),
>       re:split(L, ",").
>
>
> Best Regards,
>
> Christian
> This e-mail and any files transmitted with it are for the sole use of the 
> intended recipient(s) and may contain confidential and privileged 
> information. If you are not the intended recipient(s), please reply to the 
> sender and destroy all copies of the original message. Any unauthorized 
> review, use, disclosure, dissemination, forwarding, printing or copying of 
> this email, and/or any action taken in reliance on the contents of this 
> e-mail is strictly prohibited and may be unlawful.

Hi,

Using the HTTP interface through curl is a very easy way to load data into 
Riak, but not very efficient. It should nevertheless be quite easy to make it 
run in parallel by splitting the input file into multiple files and then run 
the script for each file. This way you can even split the load across several 
nodes in the cluster.

If the performance is still not adequate or large volumes of data need to be 
loaded on a regular basis, I would instead recommend to develop a loading 
script/solution based on one of the protobuf clients (these are available for a 
large number of languages, e.g. Java and Ruby). This should give considerably 
better performance, especially if using multiple connections in parallel.

Best regards,

Christian



This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful.

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to