[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader

BitsOfInfo (JIRA) Thu, 19 Nov 2009 17:45:04 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780367#action_12780367
 ]


BitsOfInfo commented on MAPREDUCE-1176:
---------------------------------------

>>>Why can't you just keep defaultSize and recordLength as longs?

Because the findbugs threw warnings if they were not cast, secondly the code 
works as expected. Please just shoot over how you want that calculation 
re-written and I can certainly change it. 

>>>- In isSplitable, you catch the exception generated by getRecordLength and 
>>>turn off splitting.
>>> If there is no record length specified doesn't that mean the input format 
>>> won't work at all?

Nope, it would still work, as I have yet to see an original raw data file 
containing records of a fixed width, that for some reason does not contain 
complete records. But that's fine, we can just exit out here to let the user 
know they need to configure that property. If there is a better place to check 
for the existence of that property please let me know.

>>>- FixedLengthRecordReader: "This record reader does not support compressed 
>>>files." Is this true?

Correct, as stated in the docs. Reason being is that in my case, when I wrote 
this I was not dealing with compressed files. Secondly, if a input file were 
compressed, I was not sure the procedure to properly compute the splits against 
a file that is compressed and the byte lengths of the records would be 
different in a compressed form, vs. once passed to the RecordReader. 

>>>- Throughout, you've still got 4-space indentation in the method bodies. 
>>>Indentation should be by 2.

Does anyone know of a automated tool that will fix this? Driving me nut going 
line by line and hitting delete 2x...... When I look at this in eclipse I am 
not seeing 4 spaces.

>>>- In FixedLengthRecordReader, you hard code a 64KB buffer. Why's this? You 
>>>should let the filesystem use its default.

Sure, I can get rid of that

>>>- In your read loop, you're not accounting for the case of read returning 0 
>>>or -1, which I believe
>>> can happen at EOF, right? Consider using o.a.h.io.IOUtils.readFully() to 
>>> replace this loop.

Ditto, I can change to that.

>>>As a general note, I'm not sure I agree with the design here. Rather than 
>>>forcing the split to lie on record boundaries,

Ok, thats fine, I just wanted to contribute what I wrote that is working for my 
case. 

>>> open the record reader, skip forward to the next record boundary 

Hmm, ok, do you have suggestion on how I detect where one record begins and one 
record ends when records are not identifiable by any sort of consistent "start" 
character or "end" character "boundary" but just flow together?  I could see 
the RecordReader detecting that it only read < RECORD LENGTH bytes and hitting 
the end of the split and discarding it. But I am not sure how it would detect 
the start of a record, with a split that has partial data at the start of it. 
Especially if there is no consistent boundary/char marker that identifies the 
start of a record.






> Contribution: FixedLengthInputFormat and FixedLengthRecordReader
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-1176
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 0.20.1, 0.20.2
>         Environment: Any
>            Reporter: BitsOfInfo
>            Priority: Minor
>         Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch
>
>
> Hello,
> I would like to contribute the following two classes for incorporation into 
> the mapreduce.lib.input package. These two classes can be used when you need 
> to read data from files containing fixed length (fixed width) records. Such 
> files have no CR/LF (or any combination thereof), no delimiters etc, but each 
> record is a fixed length, and extra data is padded with spaces. The data is 
> one gigantic line within a file.
> Provided are two classes first is the FixedLengthInputFormat and its 
> corresponding FixedLengthRecordReader. When creating a job that specifies 
> this input format, the job must have the 
> "mapreduce.input.fixedlengthinputformat.record.length" property set as follows
> myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]);
> OR
> myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, 
> [myFixedRecordLength]);
> This input format overrides computeSplitSize() in order to ensure that 
> InputSplits do not contain any partial records since with fixed records there 
> is no way to determine where a record begins if that were to occur. Each 
> InputSplit passed to the FixedLengthRecordReader will start at the beginning 
> of a record, and the last byte in the InputSplit will be the last byte of a 
> record. The override of computeSplitSize() delegates to FileInputFormat's 
> compute method, and then adjusts the returned split size by doing the 
> following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) 
> * fixedRecordLength)
> This suite of fixed length input format classes, does not support compressed 
> files. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader

Reply via email to