Hi Ashutosh,

Actually, data is currently in hive (Text with LazySimpleSerde) and I have
developed a version of PigStorage (similar to PigPerformanceLoader) to load
the text data in hive. To make this work with hive, I am copying over some
of the serde parameters into hcat params. Also, because the parsing logic
is currently copied to pig, it doesnt directly support escapeChar,
lastColumnTakeRest etc.
Also, with this, I will have to do one loadfunc for sequencefile and they
wont work with lazybinaryserde etc.
I am trying to understand if there is an elegant design to solve this
problem. I would appreciate if you can send me pointers for such an
approach.

Thanks,
Aniket

On Thu, Nov 10, 2011 at 11:26 AM, Ashutosh Chauhan <[email protected]>wrote:

> Hey Aniket,
>
> I am assuming you already have a Pig loadfunc which you want to use with
> HCatalog, Daniel's work on
> https://issues.apache.org/jira/browse/HCATALOG-121 has made this
> super-easy. You can follow the testcase in that patch to see how to make
> your custom loadfunc work with HCatalog. Essentially, top-level field
> delimiter you can specify as a table property. Parsing of individual field
> will be done through LoadCaster interface of loadfunc wherein you can
> plugin your parsing logic, which can make use of delimiters within fields
> for complex types.
>
> Hope it helps,
> Ashutosh
>
> On Wed, Nov 9, 2011 at 16:26, Aniket Mokashi <[email protected]> wrote:
>
> > Hi,
> >
> > I have been playing with supporting loading of sequencefile and text
> based
> > tables from hive using pig for last few days. I am wondering what would
> be
> > the best way to proceed with this. Please share any pointers to design
> > ideas and where to look, for developing this.
> >
> > Hive stores text data with multiple delimiters for field, collection and
> > maps. I tried using  LoadFuncBasedInputDriver to support multiple
> delimiter
> > text loading. For this, I am passing down these delimiters as arguments
> to
> > the loadfunc. Also, the parsing code is inside my loadfunc method. I am
> > also tied to one serde for doing this. This is not an elegant way. I am
> > thinking of delegating this task to serde and constructing the LazyStruct
> > out of it (I am not sure if that will still keep it generic).
> >
> > Any ideas how I should proceed with this?
> >
> > Thanks,
> > Aniket
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Reply via email to