Re: Review Request: SchemaTuple in Pig

Jonathan Coveney Fri, 06 Apr 2012 12:50:57 -0700


> On 2012-04-06 03:30:08, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
> >  line 188
> > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188>
> >
> >     I would make it:
> >     TupleFactory.geInstanceForSchema().newTuple()
> 
> Jonathan Coveney wrote:
>     I think it's worth thinking about how we want to do this. Dmitriy 
> implemented newTupleForSchema for the PrimitiveTuple work, and he suggested I 
> use that for this... however I agree, I think there should be a 
> "SchemaTupleFactory" that generates SchemaTuples of a given SchemaTuple. At 
> the same time, that might be a lot of scaffolding for not a lot of gain: once 
> you generate the code for a given SchemaTuple, there's not a lot of work to 
> be done (though this could encapsulate some of the static maps that come 
> later). One other factor is that the TupleFactory interface doesn't really 
> lend itself very nicely to the SchemaTuple, but I do think it could be 
> extended and be made to work. Would appreciate more of your thoughts on this.
> 
> Julien Le Dem wrote:
>     I meant:
>     TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()
>     
>     The change is relatively small, it is just moving this from one class to 
> the other and getting rid of the Maps. PrimitiveTuple should follow the same 
> logic.
>     
>     Asking for the schema in the factory initialization will simplify the 
> code. As you say, you won't need to constantly look up the schema. Also pig 
> is about writing a lot of tuples of the same type. 
>     
>     The block of code (see 2 comments above) that triggers the code 
> generation would just get the factories instead of actually generating 
> Tuples. This is caused by the api asking for the schema in the wrong place.
>     
>     TupleFactory is an abstract class so it should be doable to add methods 
> without breaking compatibility.
>     
>     If needed, backard compatibility can be maintained by having 
> newTupleForSchema(inputSchemaForGen) call 
> TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()
>


I'd like your thoughts fleshing out the TupleFactory interface for a 
SchemaTupleFactory (which imho at this point makes the most sense -- 
TupleFactory.getInstanceForSchema(inputSchema); returns a SchemaTupleFactory 
which extends TupleFactory).

newTuple();
easy

newTuple(int size);
meaningless, throw an error?

newTuple(List c);
uses set(List object), throws error otherwise?

newTuple(Object datum);
meaningless?

newTupleNoCopy(List list);
same as newTuple(list);

If that sounds reasonable, I like this approach


> On 2012-04-06 03:30:08, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70
> > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70>
> >
> >     largestSetValue should already be correct. Why does largestSetValue 
> > stays at the size of the schema?
> 
> Jonathan Coveney wrote:
>     I guess once it is at the value of the SchemaTuple (excluding appends), 
> there's no need to update it.
> 
> Julien Le Dem wrote:
>     Correct but know you have to check this every time you update. I was 
> thinking it would simplify the code a little if it was just the actual size 
> of the tuple (including past the original size).

Well, I guess that depends how we define the size of the tuple. To me, the size 
= number of fields (independently of whether they are set) + number of append 
fields. The reason we need this field, then, is to keep track of whether or not 
we've set one of the generated fields. We could redefine the size to be the 
largest non-null field, but that's going to make it slower for minimal gain. 
Thoughts?


> On 2012-04-06 03:30:08, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142
> > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130>
> >
> >     could be computed statically by the generated class if you make a 
> > static getSchema(schemaString).
> 
> Jonathan Coveney wrote:
>     Well, my thought is that a static cache wouldn't take up much memory, and 
> it would avoid having to recreate the Schema object every time getSchema() is 
> called. I don't forsee it being called a ton, but possibly enough that 
> recreating the object a bunch would be wasteful. Perhaps this is premature 
> optimization?
> 
> Julien Le Dem wrote:
>     caching is fine.
>     I meant that you could have the Schema object "cached" in a static field 
> of the generated class
>     instead of a runtime generation + cache
>     
>     in the generated class:
>     
>     static schemaString = "....";
>     static Schema schema = Utils.getSchemaFromString(schemaString);
>     
>     public Schema getSchema() {
>       return schema;
>     }
>     
>

Ah, very good call


> On 2012-04-06 03:30:08, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159
> > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159>
> >
> >     do we need all of set(Tuple), set(List), setEqualTo(List), 
> > setEqualTo(Tuple) ?
> 
> Jonathan Coveney wrote:
>     My working version adds even more. I think a lot of it should be made 
> protected, or at least, I should be more thoughtful about what it should be. 
> I think "set" is probably enough, but perhaps it should just be the void 
> version? I guess this is where working with ruby where everything generally 
> returns "this" by default, thus allowing for nice chaining of methods, is the 
> norm.
> 
> Julien Le Dem wrote:
>     I'm fine with the chaining mechanism.
>     We should try to avoid having both as it makes the code harder to read. 
> You cant set and ignore the return value;

good call


> On 2012-04-06 03:30:08, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321
> > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310>
> >
> >     refactor DataByteArray to provide this?
> 
> Jonathan Coveney wrote:
>     So here is the thing about this. Currently, a ton of the useful methods 
> in BinInterSedes (where this logic was taken from) is private. Also, they 
> aren't static (though there's no real reason why they shouldn't be?) This 
> snippet (and other logic like it) should probably be made into a protected 
> static method of BinInterSedes. Thoughts? It could also make sense for 
> classes to have a static method to serialize themselves, but I'd argue that 
> the BinInterSedes approach is probably more consistent with how pig is laid 
> out (though I have no idea why most of the methods there are private instead 
> of protected static).
> 
> Julien Le Dem wrote:
>     private stuff can be safely moved around and refactored :) (from a 
> compatibility point of view)
>     Let's think of a way to have the same code used in all cases. There used 
> to be one type of tuple so what made sense before may need to changee.
>     
>     - parent class for both ?
>     - TupleSerializer class ?
>     - static helpers?

per your suggestion and dmitriy's go ahead on the list serv, I'm moving a bit 
of the shared logic to a SedesHelper class which both BinInterSedes and 
SchemaTuple will use


> On 2012-04-06 03:30:08, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333
> > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325>
> >
> >     Same comment, the String serialization should be shared with the 
> > regular Tuple.
> >     
> >     Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage 
> > and convert lazilyto String (possibly caching the result)
> 
> Jonathan Coveney wrote:
>     This logic is stolen directly from BinInterSedes (see comment above). Can 
> you flesh out what you mean about UTF8? If we change it here we should 
> probably change it there as well (and make that the canonical protected 
> static source of all your string serialization needs).
> 
> Julien Le Dem wrote:
>     Following your goal of reducing memory utilization and as the serialized 
> format is UTF8, we could keep UTF8 as the memory representation instead of 
> String.
>     Java Strings are UTF16 which is minimum 2 bytes per characters. If the 
> data is mostly ascii, this is wasteful. 
>     To maintain backward compatibility the get(int) would lazily convert to 
> String on demand, possibly caching the result in a second field to save 
> processing, but this may be overkill.

What is the latency on utf8 -> utf16 conversion? That's all I'd be worried 
about but your suggestion is a good one.


> On 2012-04-06 03:30:08, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452
> > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452>
> >
> >     is (null, null, null) the same as null ?
> 
> Jonathan Coveney wrote:
>     In DefaultTuple, this is deprecated and not really used. Given that, I 
> use this to basically mean "has any information been written to this." So in 
> your case, both would be null. The more important question is to make sure I 
> use it consistently throughout. Honestly, the whole null thing is a pain, and 
> I need to really comb over things to make sure I incorporated it properly.
> 
> Julien Le Dem wrote:
>     agreed, we should just make sure this is the same behavior as the 
> DefaultTuple

DefaultTuple doesn't implement isNull(). It's just a dummy null implementation. 
But the difference is that I have fields that can be "null" but aren't actually 
null (ie primitives), so it's a bit more useful.


> On 2012-04-06 03:30:08, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474
> > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474>
> >
> >     this.set(t) ?
> 
> Jonathan Coveney wrote:
>     reference is this weird Franken-method that isn't used anywhere in the 
> codebase. I don't know that I want to implement it when the semantics of what 
> it should do don't seem clear at all. Open to thoughts on this though. The 
> original intent of it doesn't make much sense for a SchemaTuple since the 
> underlying structures are primitives, not objects.
> 
> Julien Le Dem wrote:
>     "the underlying structures are primitives, not objects"
>     for now, as Bags and Maps will come.
>     For backward compatibility, I think we should implement this as a set(). 
> It seems this is intended as a cheap set, not sure if "modifying one modifies 
> the others" behavior is expected. some UDFs could depend on this.
>     
>     Also let's mark reference() as deprecated right now so that we can remove 
> it later.

 I have no problem making this a set. The nice part about doing that is it 
means you can essentially get the functionality of a set without having to cast 
it. The downside being the weird semantics around it. Probably a net win. Agree 
on deprecating reference


> On 2012-04-06 03:30:08, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698
> > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698>
> >
> >     this file has 1500 lines :)
> >     I think those classes have earned their own package.
> 
> Jonathan Coveney wrote:
>     Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which is 
> why I don't like the idea of splitting it out. Yeah, it makes the file 
> larger, but it makes 0 sense outside of the context of what is going on here. 
> Perhaps that should be rectified?
> 
> Julien Le Dem wrote:
>     I don't feel strongly about it.
>     Maybe you want to separate the base class SchemaTuple (runtime) from the 
> Generator ?

Agreed, very good suggestion


> On 2012-04-06 03:30:08, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138
> > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138>
> >
> >     CheckIfNullString.masks
> >     maybe they should be in a Masks class then, along with the masking code 
> > generation (" & (byte)+...)
> 
> Jonathan Coveney wrote:
>     I don't know how worthwhile it is to split all of this stuff out into 
> classes. Maybe it is. A lot of stuff is used once or twice in disparate 
> places, and I fear that splitting it out more and more will just make people 
> jump through 50 classes to understand what a line of code gen is doing. On 
> the other hand, it would make fixing bugs much cleaner. What probably needs 
> to be done is some sort of cleaner code generation framework... would 
> appreciate more thoughts here.
> 
> Julien Le Dem wrote:
>     My goal is to have the bit management thing implemented once so that it 
> is easier to change/improve/bugfix.
>     agreed that we should keep things simple.
>     If we have a SchemaTupleClassGenerator  that contains bit management 
> helpers that could simplify things.
>     Separating the runtime from the generation would reduce the amount of 
> things you have to look at at the same time.

+1


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/#review6722
-----------------------------------------------------------


On 2012-04-06 17:20:20, Jonathan Coveney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4651/
> -----------------------------------------------------------
> 
> (Updated 2012-04-06 17:20:20)
> 
> 
> Review request for pig and Julien Le Dem.
> 
> 
> Summary
> -------
> 
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing 
> the Schema on the frontend, we can code generate Tuples which can be used for 
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, 
> and it's ~15% smaller serialized (heavily heavily depends on the data, 
> though). Need to do get/set tests, but assuming that it's on par (or even 
> faster) than Tuple, the memory gain is huge.
> 
> Need to clean up the code and add tests.
> 
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema 
> given to UDF's. The next step is to make a SchemaBag, where I think the 
> serialization savings will be really huge.
> 
> Needs tests and comments, but I want the code to settle a bit.
> 
> 
> This addresses bug PIG-2632.
>     https://issues.apache.org/jira/browse/PIG-2632
> 
> 
> Diffs
> -----
> 
>   trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 
>   
> trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
>  1309628 
>   trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 
>   trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 
>   trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION 
>   trunk/src/org/apache/pig/data/TupleFactory.java 1309628 
>   trunk/src/org/apache/pig/impl/PigContext.java 1309628 
>   trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 
> 1309628 
> 
> Diff: https://reviews.apache.org/r/4651/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jonathan
> 
>

Re: Review Request: SchemaTuple in Pig

Reply via email to