Re: Question about hdfs close * hflush behavior

Todd Lipcon Thu, 08 Sep 2011 08:57:49 -0700

2011/9/8 Kanghua151 <kanghua...@msn.com>:
> you are so nice，thank you very much：）
> last question：
> can i trigger block sync without restart hdfs？


Close the file or have a machine crash :) But no, not really.

>
>
> 发自我的 iPhone
>
> 在 2011-9-8，15:00，Todd Lipcon <t...@cloudera.com> 写道：
>
>> 2011/9/7 kang hua <kanghua...@msn.com>:
>>> Thanks my friend!
>>> please allow me to ask more question about detail thinks!
>>> 1 yes, I can use hadoop fs -tail or -cat xxx to see that file content, But
>>> how can I get that file real size in other process if namenode is not change
>>> ?  I real want is to read the date in tail  of that file.
>>
>> You can open the file and then use an API on the DFSInputStream class
>> to find the length. I don't recall the name of the API, but if you
>> look in there, you should see it.
>>
>>>
>>> 2 why "when I reboot hdfs, I can see that file's content that I flushed
>>> again by "hadoop fs -ls xxx" "
>>
>> On restart, the namenode triggers block synchronization, and the
>> up-to-date length is determined.
>>
>>> 3 In append mode.  close file and open it with append mode again and again .
>>> real dataspace is normally increase, but nodename  show dfs used space
>>> increase to fast. it is a bug ?
>>
>> Might be a bug, yes.
>>
>>> 4 which version of hdfs that append is no bug ?
>>
>> 0.21, which is buggy in other aspects. So, no stable released version
>> has a working append() call.
>>
>> In truth I've never seen a _good_ use case for
>> append-to-an-existing-file. Usually you can do just as well by keeping
>> the file open and periodically hflushing, or rolling to a new file
>> when you want to add more records to an existing dataset.
>>
>> -Todd
>>
>>>> From: t...@cloudera.com
>>>> Date: Wed, 7 Sep 2011 14:17:10 -0700
>>>> Subject: Re: Question about hdfs close * hflush behavior
>>>> To: hdfs-user@hadoop.apache.orgSend
>>>>
>>>> 2011/9/7 kang hua <kanghua...@msn.com>:
>>>>>
>>>>> Hi friends:
>>>>> I has two question.
>>>>> first one is:
>>>>> I use libhdfs's hflush to flush my data to a file, in same process
>>>>> context I can read it. But I find that file unchanged if I check from
>>>>> hadoop
>>>>> shell ---- it's len is zero( check by "hadoop fs -ls xxx" or read it in
>>>>> program); however when I reboot hdfs, I can read that file's content
>>>>> that I
>>>>> flushed again。 why ?
>>>>
>>>> If we were to update th e file metadata on hflush, it would be very
>>>> expensive, since the metadata lives in the NameNode.
>>>>
>>>> If you do hadoop fs -cat xxx, you should see the entirety of the flushed
>>>> data.
>>>>
>>>>> can I hflush data to file without close it,at same time read data
>>>>> flushed
>>>>> by other process ？
>>>>
>>>> yes.
>>>>
>>>
>>>
>>>
>>>
>>>
>>>>> second one is:
>>>>> does once close hdfs file, the last written block is untouched. even
>>>>> open
>>>>> that file with append mode, namenode will alloc a new block to for
>>>>> append
>>>>> data?
>>>>
>>>> No, it reopens the last block of the existing file for append.
>>>>
>>>>> I find if I close file and open it with append mode again and again.
>>>>> hdfs
>>>>> report will show "used space much more that the file logic size"
>>>>
>>>> Not sure I follow what you mean by this. Can you give more d etail?
>>>>
>>>>> btw: I use cloudera ch2
>>>>
>>>> The actual "append()" function has some bugs in all of the 0.20
>>>> releases, including Cloudera's. The hflush/sync() API is fine to use,
>>>> but I would recommend against using append().
>>>>
>>>> -Todd
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Question about hdfs close * hflush behavior

Reply via email to