Hi Abdullah, Actually I also have the concern that adding null-check for general cases will bring extra overheads. Thus I plan to add the checking procedure after parser, but before addTuple, i.e.FeedRecordDataFlowController. But based on what I have seen so far, it seems RecordType is transparent to FeedRecordDataFlowController. So I am still investigating that...
I saw the null check in ADM parser. That's actually a viable way to handle that within the parser scope. But I am looking for a slightly different solution. In my perspective, ADM parser assumes the input adm should conform with the dataset definition. Thus it's reasonable for it to throw a exception. For Tweetparser, if I saw null value on non-null attribute, I will discard the whole tweet directly, and may not even log it(as too many tweets with null). That's the reason why I want to put that in FeedRecordDataFlowController, since I didn't see there is a good way to prevent record insert in parser except for throw exception. Not sure my opinion makes sense or not. Feel free to comment. :) Best, Xikui On Sat, Apr 30, 2016 at 1:52 PM, abdullah alamoudi <bamou...@gmail.com> wrote: > Adding a few points here: > > My feeling is SerializerDeserializer offers another level of abstraction > but with output I can write value directly without construct AType object. > I am wondering if there are any preferences over these two? > > - Using The SerializerDeserializer option, you will only create a single > object regardless of the number of parsed records, so I wouldn't worry > about it. Code maintainability takes precedence here IMO. > - In addition to records and lists, UTF8StringSerializerDeserializer can be > stateful for the same reason (avoid creating lost of un-needed objects). In > fact, our parsers use the stateful UTF8StringSerializerDeserializer since I > noticed that using the stateless one creates lots of byte[] and triggers GC > over and over. > - Right now, we parse missing values as null. Should that change? > - There is definitely a check for nulls on non-nullable values at least in > the ADM parser. There might be a bug however that makes it accept explicit > null values and that should be fixed. > > I am for NOT using the cast record solution for the overhead it will add. > but that is just me :) > ~Abdullah. > > > On Sat, Apr 30, 2016 at 6:48 AM, Xikui Wang <xik...@uci.edu> wrote: > > > Thank you Yingyi. I will try to figure out a solution from that > direction. > > > > Best, > > Xikui > > > > On Fri, Apr 29, 2016 at 3:48 PM, Yingyi Bu <buyin...@gmail.com> wrote: > > > > > Yeah, I think so:-) > > > > > > Best, > > > Yingyi > > > > > > On Fri, Apr 29, 2016 at 3:46 PM, Mike Carey <dtab...@gmail.com> wrote: > > > > > > > This indeed might be cleaner? > > > > > > > > > > > > On 4/29/16 3:28 PM, Yingyi Bu wrote: > > > > > > > >> I'm guessing that you can do similar things to CastRecordDescriptor > > > >>>> if you want to handle general cases in that region. > > > >>>> > > > >>> Or, you can inject a cast-record function in the loading pipeline > > > >> so that you can defer the runtime-type-check/cast to that function > > > instead > > > >> of doing that in the parser. > > > >> > > > >> > > > >> On Fri, Apr 29, 2016 at 3:25 PM, Yingyi Bu <buyin...@gmail.com> > > wrote: > > > >> > > > >> My answer is inlined. > > > >>> > > > >>> My feeling is SerializerDeserializer offers another level of > > > abstraction > > > >>>>> but with output I can write value directly without construct > AType > > > >>>>> > > > >>>> object. > > > >>> > > > >>>> I am wondering if there are any preferences over these two? > > > >>>>> > > > >>>> I agree with you. However, a SerializerDeserializer has to be > > > stateless, > > > >>> hence it cannot be used at runtime for complex type objects such as > > > >>> records and lists, > > > >>> because it will create a lot Java objects. > > > >>> > > > >>> in other words, parser has to guarantee that the > > > >>>>> processed records has to match the dataset > definition(non-optional > > > >>>>> attribute cannot have null value). I tried to assign null value > to > > > >>>>> > > > >>>> non-null > > > >>> > > > >>>> attributes. It will be inserted successfully but read records will > > > have > > > >>>>> problem. > > > >>>>> > > > >>>> That sounds right to me. Please file a JIRA issue and assign to > > you ( > > > >>> if you're working on that). > > > >>> I'm guessing that you can do similar things to CastRecordDescriptor > > > >>> if you want to handle general cases in that region. > > > >>> > > > >>> 3. Set to null or skip > > > >>>>> For optional(nullable) attributes, if I want to insert a record > > with > > > >>>>> > > > >>>> null > > > >>> > > > >>>> value on that attribute. Should I assign null value or should I > just > > > >>>>> > > > >>>> skip > > > >>> > > > >>>> it? (Probably this is related to the missing attribute that Yingyi > > > >>>>> mentioned today?) > > > >>>>> > > > >>>> Assign null value. > > > >>> Missing means the field doesn't exist in a record at all. > > > >>> > > > >>> Best, > > > >>> Yingyi > > > >>> > > > >>> > > > >>> On Fri, Apr 29, 2016 at 2:06 PM, Xikui Wang <xik...@uci.edu> > wrote: > > > >>> > > > >>> Hi devs, > > > >>>> > > > >>>> I came across several questions while I was constructing records > in > > > >>>> AsterixDB. Hope someone can help me clear the confusion. :) > > > >>>> > > > >>>> 1. Write directly to data output or use SerializerDeserializer > > > >>>> I am working with AbstractDataParser now. I see people using > > different > > > >>>> ways > > > >>>> to append attributes to data output. Either use: > > > >>>> output.Write(typetag.serialize()); > > > >>>> output.WriteInt(0); > > > >>>> to write into data output directly, or > > > >>>> use AInt8SerializerDeserializer.serialize(int8Serde) to serialize > a > > > >>>> AINT8 > > > >>>> instance to output. *SerializerDeserializer uses writeByte to > write > > > >>>> output. > > > >>>> > > > >>>> My feeling is SerializerDeserializer offers another level of > > > abstraction > > > >>>> but with output I can write value directly without construct AType > > > >>>> object. > > > >>>> I am wondering if there are any preferences over these two? > > > >>>> > > > >>>> 2. RecordType validation after parser but before add to frame? > > > >>>> My observation is after parser finish writing the output and pass > it > > > to > > > >>>> next level, there is no such validation that checks whether > > > non-optional > > > >>>> field is null or not. In other words, parser has to guarantee that > > the > > > >>>> processed records has to match the dataset definition(non-optional > > > >>>> attribute cannot have null value). I tried to assign null value to > > > >>>> non-null > > > >>>> attributes. It will be inserted successfully but read records will > > > have > > > >>>> problem. > > > >>>> > > > >>>> 3. Set to null or skip > > > >>>> For optional(nullable) attributes, if I want to insert a record > with > > > >>>> null > > > >>>> value on that attribute. Should I assign null value or should I > just > > > >>>> skip > > > >>>> it? (Probably this is related to the missing attribute that Yingyi > > > >>>> mentioned today?) > > > >>>> > > > >>>> Thanks for your help. > > > >>>> > > > >>>> Best, > > > >>>> Xikui > > > >>>> > > > >>>> > > > >>> > > > > > > > > > >