[jira] Commented: (HADOOP-2657) Enhancements to DFSClient to support flushing data at any point in time

Raghu Angadi (JIRA) Thu, 06 Mar 2008 14:52:17 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575936#action_12575936
 ]


Raghu Angadi commented on HADOOP-2657:
--------------------------------------


# The patch needs to be updated for trunk.
# why isn't {{FSOutputSummer.flushBuffer()}} just {{flustBuffer(false);}}? 
# not sure why only the latter does {{count = chunkLen;}}
# In DFSClient : flush() sets {{closed}} to true without the clean up done in 
{{closeInternal()}}, should it invoke {{closeInternal()}} instead?
# I don't think I followed everything thoroughly. I will chat with you regd 
specifics if required.

General thought : The flush implemented here looks very much like fsync() to 
me.. thats why we have extra RPC cost if user flushes data just before closing. 
This even invokes namenode.fsync(). Waiting for ack from datanodes is another 
thing that makes it behave like fsync(). Obviously there is nothing wrong with 
it extra guarantees. it is just more than what user might want and expect when 
they invoke flush(). These extra guarantees usually tend to have extra costs 
and might limit (now or in future) primary advantages of HDFS : scalability, 
throughput, and reliability. I would just flush the data to socket and not wait 
for anything else. In constrast, fsync() tends to be used much less frequently 
because users know it would be costly.


> Enhancements to DFSClient to support flushing data at any point in time
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-2657
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2657
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: flush.patch, flush2.patch, flush3.patch, flush4.patch
>
>
> The HDFS Append Design (HADOOP-1700) requires that there be a public API to 
> flush data written to a HDFS file that can be invoked by an application. This 
> API (popularly referred to a fflush(OutputStream)) will ensure that data 
> written to the DFSOutputStream is flushed to datanodes and any required 
> metadata is persisted on Namenode.
> This API has to handle the case when the client decides to flush after 
> writing data that is not a exact multiple of io.bytes.per.checksum.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2657) Enhancements to DFSClient to support flushing data at any point in time

Reply via email to