[jira] Updated: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader

Chris Douglas (JIRA) Mon, 22 Mar 2010 19:39:53 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Douglas updated MAPREDUCE-1176:
-------------------------------------

    Status: Open  (was: Patch Available)

* The comments so dense that they make the code hard to read. Getting the right 
mix is a balance, but this sort of annotation is just noise:
{noformat}
+    // fetch configuration
+       Configuration conf = job.getConfiguration();
[snip]
+    // if the currentPosition is less than the split end..
+    if (currentPosition < splitEnd) {
[snip]
+  // reference to the input stream
+  private FSDataInputStream fileInputStream;
{noformat}
Please limit comments to only those sections (and user-visible javadoc) that 
require non-local context.
* The utility {{toBytes}} can be replaced with use of 
{{DataOutputBuffer::writeLong}}
* The {{BytesWritable}} key/value instances can be initialized in the 
{{initialize}} method, rather than lazily in {{nextKeyValue}}
* While splitting compressed files is not meaningful to this reader in general, 
as long as each split is one file (and the fixed record size is enforced by the 
reader), it need not be illegal to use it on compressed files
* {{recordKeyEndAt}} seems mostly unused. Isn't the key start and length 
sufficient?
* Is there any advantage to setting the fixed attributes separately? Would a 
single method- performing the relevant boundary checks- setting all the record 
attributes be sufficient?
* The unit test includes {{!}} to confirm it has found a record boundary, but 
it is also included in the random charset (possible, albeit unlikely, false 
positives)
* Instead of random data, validating that each record is composed of 
deterministic key/record data would be a more complete test (e.g. 
{{^VVVVKKKKVVVVV$}} for all records. If you want to include a random test, 
varying the key start and length would work.

> Contribution: FixedLengthInputFormat and FixedLengthRecordReader
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-1176
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 0.20.2, 0.20.1
>         Environment: Any
>            Reporter: BitsOfInfo
>            Priority: Minor
>         Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, 
> MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch
>
>
> Hello,
> I would like to contribute the following two classes for incorporation into 
> the mapreduce.lib.input package. These two classes can be used when you need 
> to read data from files containing fixed length (fixed width) records. Such 
> files have no CR/LF (or any combination thereof), no delimiters etc, but each 
> record is a fixed length, and extra data is padded with spaces. The data is 
> one gigantic line within a file.
> Provided are two classes first is the FixedLengthInputFormat and its 
> corresponding FixedLengthRecordReader. When creating a job that specifies 
> this input format, the job must have the 
> "mapreduce.input.fixedlengthinputformat.record.length" property set as follows
> myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]);
> OR
> myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, 
> [myFixedRecordLength]);
> This input format overrides computeSplitSize() in order to ensure that 
> InputSplits do not contain any partial records since with fixed records there 
> is no way to determine where a record begins if that were to occur. Each 
> InputSplit passed to the FixedLengthRecordReader will start at the beginning 
> of a record, and the last byte in the InputSplit will be the last byte of a 
> record. The override of computeSplitSize() delegates to FileInputFormat's 
> compute method, and then adjusts the returned split size by doing the 
> following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) 
> * fixedRecordLength)
> This suite of fixed length input format classes, does not support compressed 
> files. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader

Reply via email to