I've been using protocol buffers to serialize the data and then
encoding them in base64 so that I can then treat them like text.  This
obviously isn't optimal, but I'm assuming that this is only a short
term solution which won't be necessary when non-Java clients become
first class citizens of the Hadoop world.

Chris

On Mon, Nov 3, 2008 at 2:24 PM, Pete Wyckoff <[EMAIL PROTECTED]> wrote:
>
> Protocol buffers, thrift?
>
>
> On 11/3/08 4:07 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote:
>
> Zhou, Yunqing wrote:
>> embedded database cannot handle large-scale data, not very efficient
>> I have about 1 billion records.
>> these records should be passed through some modules.
>> I mean a data exchange format similar to XML but more flexible and
>> efficient.
>
>
> JSON
> CSV
> erlang-style records (name,value,value,value)
> RDF-triples in non-XML representations
>
> For all of these, you need to test with data that includes things like
> high unicode characters, single and double quotes, to see how well they
> get handled.
>
> you can actually append with XML by not having opening/closing tags,
> just stream out the entries to the tail of the file
> <entry>...</entry>
>
> To read this in an XML parser, include it inside another XML file:
>
> <?xml version="1.0"?>
> <!DOCTYPE log [
>      <!ENTITY log SYSTEM "log.xml">
> ]>
>
> <file>
> &log;
> </file>
>
> I've done this for very big files, as long as you aren't trying to load
> it in-memory to a DOM, things should work
>
> --
> Steve Loughran                  http://www.1060.org/blogxter/publish/5
> Author: Ant in Action           http://antbook.org/
>
>
>

Reply via email to