Hi Group,
Thanks for the effort placed into making Hadoop available, and for the tips
and suggestions posted to this mailing list.
I'm currently looking into using Hadoop and, along the way, writing a Ruby DSL
for Cascading (www.cascading.org).
Some homework has generated two related ideas. I'll raise the other in a
separate email.
I would like to get some feedback on these ideas and findout how best to go
about making an enhancement request.
Browsing the email archive it seems several users have a use case not
dissimilar from one I have in mind.
Alternatively, they have been advised to 'formulate' their data/problem in
a way that would fit the following InputFormat ideas.
As you'll see some use cases cited below. I'm not suggesting that
workarounds and custom InputFormats are not possible.
Rather these cases indicate that I'm not alone in thinking along these lines,
and it seems custom formats are repeatedly being written when one 'core'
format might suffice.
I have another use case where this InputFormat would be
critical/important/useful. I'll raise this in a separate email.
Proposal:
Introduce, for want of a better term, a CanonicalInputFormat (CIF).
Canonical in the sense the input file contents conform to the key-value idea.
Description:
The CIF takes two files:
- Index file (similar to a key)
- Data file (similar to a value)
Index File:
- Text
- Tab-delimted(default) fields
- One record is a single line (default).
- The only requirement is that _somewhere_ in the record-fields are the
start-end positions (in separate fields) for the data file content, the
data-chunk, which corresponds to the record.
- Each record/line triggers a MR job with the complete index line/record as the
key (the value is some data parsed from the data-chunk the record points to).
Data File:
- Can be binary file
- Fixed length data-record within each data-chunk
- Alternatively, the whole datachunk is one data-record
- Can be text file,
- Variable length data-records
- Required fixed number of delimted fields, optional data-record delimiter
- Alternatively, required record delimiter
- Fixed length data-records
- optional data-record delimter
- Data-record delimiters are trimmed before passing the data-record on.
The CIF InputSplit respects the data-chunk's begin-end position values,
obtained from the index file entry.
Only one data-chunk is passed to the RecordReader.
Essentially, the data-file is still being split, but in a more user defined way.
[Maybe this needs to allow Hadoop to further split the data-chunk?]
The CIF RecordReader takes user defined format information and then parses
the data-chunk provided by the InputSplit.
One could adopt format descriptions such as:
http://www.rubycentral.com/book/ref_c_string.html#String.unpack
The JRuby project may have a Java library that implements this unpack.
Alternatively,
http://home.mho.net/jswaby/fb_doc.html
or some other 'best of breed' format definition/syntax.
The Map-Reduce Job is finally passed N key-value pairs:
- Key: the complete line read from the index file.
- Value: one data-record parsed from the data-chunk.
OutputFormat:
It would be useful to have a corresponding output format.
Illustration:
>From the streaming guide's 'Field Selection' example:
http://hadoop.apache.org/core/docs/r0.18.1/streaming.html#More+usage+examples
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.FieldSelectionMapReduce\
-reducer org.apache.hadoop.mapred.lib.FieldSelectionMapReduce\
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf map.output.key.field.separa=. \
-jobconf num.key.fields.for.partition=2 \
-jobconf mapred.data.field.separator=. \
-jobconf map.output.key.value.fields.spec=6,5,1-3:0- \
-jobconf reduce.output.key.value.fields.spec=0-2:5- \
-jobconf mapred.reduce.tasks=12
## Proposed additions ##
-indexfile myIndexFile
-datafile myDataFile
-jobconf map.input.index.chunks.field.separator=, \ # allow custom
-jobconf map.input.index.chunks.record.separator=\n \ # allow custom formats
-jobconf map.input.index.chunks.data.position.spec=2,7 \# any 2 index fields
-jobconf map.input.data.chunks.splittable=false \ # true
-jobconf map.input.data.chunks.spec=binary \ # text, binary, etc.
-jobconf map.input.data.chunks.multi.records=true \ # false if
data-chunk is one record
-jobconf map.input.data.chunks.record.spec=ax2X2aX2a \ # or some
such format, only valid if multi.records is true
For text files (variable length records)
-jobconf map.input.data.chunk.spec=text \ # (t)text,
(b)binary, etc.
-jobconf map.input.data.chunk.feild.separator=\t \ # required
-jobconf map.input.data.chunk.feild.count=\t \ # required
-jobconf map.input.data.chunk.record.separator=|| \ # optional if
field count and delimiter given
-jobconf map.input.data.chunk.record.separator=||,|| # equivalent to || above
Note:
Permit open-close record delimiters - could permit the same for field
delimiters?
-jobconf map.input.data.chunk.record.separator=begin,end # permit open-close
record delimiters
For text files (variable length records)
-jobconf map.input.data.chunk.record.separator=|| \ # required
For text files (fixed length records)
-jobconf map.input.data.chunk.record.length=23 \ # required.
-jobconf map.input.data.chunk.record.separator=|| \ # optional if
record.length given
For text files (fixed length records)
-jobconf map.input.data.chunk.record.length=23 \ # optional if
record delimiter given.
-jobconf map.input.data.chunk.record.separator=|| \ # required
Advantages:
- Permits an easy form of normalization that can be _very_ cost (space)
efficient. In some cases an order of mgnitude or more, which is important
when using large archived/read-only data sets.
- Supports a very flexible formating of the index and data input files.
The index file in particuar can hold common content, along with the
data-chunk's begin-end positions.
- This InputFormat would, I think, give the following users something close to
a 'neat' solution, or an alternative to consider, and reduce the number of
CustomInputFormat being written:
- An alternative to: http://wiki.apache.org/hadoop/FAQ#10
- Amoung many similar requests:
http://www.nabble.com/one-input-file-per-map-tt18220030.html#a18262924
-
http://www.nabble.com/Streaming-%2B-custom-input-format-tt16493126.html#a16493126
-
http://www.nabble.com/on-number-of-input-files-and-split-size-tt16504428.html#a16505831
-
http://www.nabble.com/How-Mappers-function-and-solultion-for-my-input-file-problem--tt18104113.html#a18104113
-
http://www.nabble.com/streaming-%2B-binary-input-output-data--tt16537427.html#a16658687
-
http://www.nabble.com/-HADOOP-users--HowTo-filter-files-for-a-Map-Reduce-task-over-the-same-input-folder-tt16632790.html#a16632790
-
http://www.nabble.com/File-size-and-number-of-files-considerations-tt15953462.html#a15953462
- http://www.nabble.com/Not-allow-file-split-tt17104944.html#a17106102
-
http://www.nabble.com/Map-Intermediate-key-value-pairs-written-to-file-system-tt16767050.html#a16774182
etc., etc.
Hopefully I have expressed these ideas clearly.
Your thoughts and comments?
Is this worth an enhancement request on Jira?
If so the category would best be....?
Regards
Mark