I've been using protocol buffers to serialize the data and then encoding them in base64 so that I can then treat them like text. This obviously isn't optimal, but I'm assuming that this is only a short term solution which won't be necessary when non-Java clients become first class citizens of the Hadoop world.
Chris On Mon, Nov 3, 2008 at 2:24 PM, Pete Wyckoff <[EMAIL PROTECTED]> wrote: > > Protocol buffers, thrift? > > > On 11/3/08 4:07 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote: > > Zhou, Yunqing wrote: >> embedded database cannot handle large-scale data, not very efficient >> I have about 1 billion records. >> these records should be passed through some modules. >> I mean a data exchange format similar to XML but more flexible and >> efficient. > > > JSON > CSV > erlang-style records (name,value,value,value) > RDF-triples in non-XML representations > > For all of these, you need to test with data that includes things like > high unicode characters, single and double quotes, to see how well they > get handled. > > you can actually append with XML by not having opening/closing tags, > just stream out the entries to the tail of the file > <entry>...</entry> > > To read this in an XML parser, include it inside another XML file: > > <?xml version="1.0"?> > <!DOCTYPE log [ > <!ENTITY log SYSTEM "log.xml"> > ]> > > <file> > &log; > </file> > > I've done this for very big files, as long as you aren't trying to load > it in-memory to a DOM, things should work > > -- > Steve Loughran http://www.1060.org/blogxter/publish/5 > Author: Ant in Action http://antbook.org/ > > >