[jira] Commented: (HADOOP-941) Make Hadoop Record I/O Easier to use outside Hadoop

David Bowen (JIRA) Sun, 25 Feb 2007 16:51:27 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475802
 ]


David Bowen commented on HADOOP-941:
------------------------------------


A few notes from trying out this patch.

[1] When I apply the patch, to what I think is a clean workspace, I get this:
...
patching file src/test/org/apache/hadoop/record/test/TestWritable.java
Reversed (or previously applied) patch detected!  Assume -R? [n] n
Apply anyway? [n]
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file 
src/test/org/apache/hadoop/record/test/TestWritable.java.rej
...

[2] TestRecordWritable needs to be brought up-to-date because of the change in 
the Reporter interface.  Instead of defining its own do-nothing Reporter, it 
can do this: "Reporter reporter = Reporter.NULL;".

[3] In the generated code I see sections like this:

          int i1 = org.apache.hadoop.record.Utils.readVInt(b1, s1);
          int i2 = org.apache.hadoop.record.Utils.readVInt(b2, s2);
          int z1 = org.apache.hadoop.record.Utils.getVIntSize(i1);
          int z2 = org.apache.hadoop.record.Utils.getVIntSize(i2);
          s1+=z1; s2+=z2; l1-=z1; l2-=z2;
          int r1 = 
org.apache.hadoop.record.Utils.compareBytes(b1,s1,l1,b2,s2,l2);
          if (r1 != 0) { return (r1<0)?-1:0; }
          s1+=i1; s2+=i2; l1-=i1; l1-=i2;

I think the compareBytes call should be using i1 and i2, not l1 and l2.  This 
would not show up as a bug in testing, just a loss of performance.

Actually, I think it might have been better to replace this open-coding with 
method calls.  I.e. have a packed-buffer class, comprising a byte buffer, 
position and length remaining.  It would have methods like compareFloat, 
compareVLong etc.  The generated code would then be much smaller and probably 
at least as efficient.  (Currently the variable length is being computed twice 
for each int.)

[4] I don't see any unit tests for the readVLong/writeVLong combination.  
Reading the code, I am not convinced that it will work correctly with negative 
numbers.  (It looks like the number read will always be positive.)

[5] rcc when called with no arguments should print out a usage message that 
explains what all the command line options are.



> Make Hadoop Record I/O Easier to use outside Hadoop
> ---------------------------------------------------
>
>                 Key: HADOOP-941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-941
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: record
>    Affects Versions: 0.10.1
>         Environment: All
>            Reporter: Milind Bhandarkar
>         Assigned To: Milind Bhandarkar
>         Attachments: jute-patch.txt
>
>
> Hadoop record I/O can be used effectively outside of Hadoop. It would 
> increase its utility if developers can use it without having to import hadoop 
> classes, or having to depend on Hadoop jars. Following changes to the current 
> translator and runtime are proposed.
> Proposed Changes:
> 1. Use java.lang.String as a native type for ustring (instead of Text.)
> 2. Provide a Buffer class as a native Java type for buffer (instead of 
> BytesWritable), so that later BytesWritable could be implemented as following 
> DDL:
> module org.apache.hadoop.io {
>   record BytesWritable {
>     buffer value;
>   }
> }
> 3. Member names in generated classes should not have prefixes 'm' before 
> their names. In the above example, the private member name would be 'value' 
> not 'mvalue' as it is done now.
> 4. Convert getters and setters to have CamelCase. e.g. in the above example 
> the getter will be:
>   public Buffer getValue();
> 5. Provide a 'swiggable' C binding, so that processing the generated C code 
> with swig allows it to be used in scripting languages such as Python and Perl.
> 6. The default --language="java" target would generate class code for records 
> that would not have Hadoop dependency on WritableComparable interface, but 
> instead would have "implements Record, Comparable". (i.e. It will not have 
> write() and readFields() methods.) An additional option "--writable" will 
> need to be specified on rcc commandline to generate classes that "implements 
> Record, WritableComparable".
> 7. Optimize generated write() and readFields() methods, so that they do not 
> have to create BinaryOutputArchive or BinaryInputArchive every time these 
> methods are called on a record.
> 8. Implement ByteInStream and ByteOutStream for C++ runtime, as they will be 
> needed for using Hadoop Record I/O with forthcoming C++ MapReduce framework 
> (currently, only FileStreams are provided.)
> 9. Generate clone() methods for records in Java i.e. the generated classes 
> should implement Cloneable.
> 10. As part of Hadoop build process, produce a tar bundle for Record I/O 
> alone. This tar bundle will contain the translator classes and ant task 
> (lib/rcc.jar), translator script (bin/rcc), Java runtime (recordio.jar) that 
> includes org.apache.hadoop.record.*, sources for the java runtime (src/java), 
> and c/c++ runtime sources with Makefiles (src/c++, src/c).
> 11. Make generated Java codes for maps and vectors use Java generics.
> These are the proposed user-visible changes. Internally, the translator will 
> be restructured so that it is easier to plug-in translators for different 
> targets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-941) Make Hadoop Record I/O Easier to use outside Hadoop

Reply via email to