[ 
https://issues.apache.org/jira/browse/AVRO-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724369#action_12724369
 ] 

Raghu Angadi commented on AVRO-24:
----------------------------------


Thanks Doug. I agree benchmarking first would be the way to go. 

I would not be much concerned about minimizing the copies (at least 
initially)... cpu usuallly is not a bottleneck. When Datanode's CPU usage while 
serving data went down nearly 10 times between 0.16 and 0.18, hardly anyone 
noticed (same, when it doubled between 0.13 and 0.14).

Some of the cases I am interested in with bulk transfers (not not be addressed 
here, benchmarks help in assessing tthese) :

   * How the streaming reads are handled. Depending on frame size, client-side 
might not need pipelining. Reading 64KB frames might be enough to mask 1ms 
latency between frames. Otherwise it might need pipelining of multiple frames.
   * Connection management : do multiple simultaneous transfers use different 
connections? Do normal RPCs also share it. Hadoop does not yet have a case with 
lot of bulk and non-bulk RPCs at the same time.
   * Server side : Datanode : Is the disk data fetched inside RPC handler? 
Since data is directly written to client socket, will a slow client hold the 
handler.
      ** If one slower disk inhibits serving data from other disks. Larger 
number of handlers can mask this.


   

> benchmark bulk data
> -------------------
>
>                 Key: AVRO-24
>                 URL: https://issues.apache.org/jira/browse/AVRO-24
>             Project: Avro
>          Issue Type: Task
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.0.0
>
>
> It would be good to validate that the RPC wire format is capable of 
> transmitting bulk data efficiently.  In particular, to be used for HDFS file 
> access, it must be able to, when including file data in an RPC response, or 
> writing file data in an RPC request:
>  - saturate a disk's throughput or a network interface; and
>  - not consume much CPU.
> In other words, Avro's RPC should not be a bottleneck in the transfer of file 
> data from a remote disk to an application or vice versa, and moreover it 
> should leave the vast majority of the CPU for the application.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to