Re: Questions of building record in AsterixDB

Xikui Wang Sat, 30 Apr 2016 15:27:00 -0700

Hi Abdullah,

Actually I also have the concern that adding null-check for general cases
will bring extra
overheads. Thus I plan to add the checking procedure after parser, but
before addTuple,
i.e.FeedRecordDataFlowController. But based on what I have seen so far, it
seems RecordType
is transparent to FeedRecordDataFlowController. So I am still investigating
that...


I saw the null check in ADM parser. That's actually a viable way to handle
that within the
parser scope. But I am looking for a slightly different solution. In my
perspective,
ADM parser assumes the input adm should conform with the dataset
definition.
Thus it's reasonable for it to throw a exception. For Tweetparser, if I saw
null value on non-null attribute, I will
discard the whole tweet directly, and may not even log it(as too many
tweets with null).
That's the reason why I want to put that in FeedRecordDataFlowController,
since I didn't see
there is a good way to prevent record insert in parser except for throw
exception.

Not sure my opinion makes sense or not. Feel free to comment. :)

Best,
Xikui

On Sat, Apr 30, 2016 at 1:52 PM, abdullah alamoudi <[email protected]>
wrote:

> Adding a few points here:
>
> My feeling is SerializerDeserializer offers another level of abstraction
> but with output I can write value directly without construct AType object.
> I am wondering if there are any preferences over these two?
>
> - Using The SerializerDeserializer option, you will only create a single
> object regardless of the number of parsed records, so I wouldn't worry
> about it. Code maintainability takes precedence here IMO.
> - In addition to records and lists, UTF8StringSerializerDeserializer can be
> stateful for the same reason (avoid creating lost of un-needed objects). In
> fact, our parsers use the stateful UTF8StringSerializerDeserializer since I
> noticed that using the stateless one creates lots of byte[] and triggers GC
> over and over.
> - Right now, we parse missing values as null. Should that change?
> - There is definitely a check for nulls on non-nullable values at least in
> the ADM parser. There might be a bug however that makes it accept explicit
> null values and that should be fixed.
>
> I am for NOT using the cast record solution for the overhead it will add.
> but that is just me :)
> ~Abdullah.
>
>
> On Sat, Apr 30, 2016 at 6:48 AM, Xikui Wang <[email protected]> wrote:
>
> > Thank you Yingyi. I will try to figure out a solution from that
> direction.
> >
> > Best,
> > Xikui
> >
> > On Fri, Apr 29, 2016 at 3:48 PM, Yingyi Bu <[email protected]> wrote:
> >
> > > Yeah, I think so:-)
> > >
> > > Best,
> > > Yingyi
> > >
> > > On Fri, Apr 29, 2016 at 3:46 PM, Mike Carey <[email protected]> wrote:
> > >
> > > > This indeed might be cleaner?
> > > >
> > > >
> > > > On 4/29/16 3:28 PM, Yingyi Bu wrote:
> > > >
> > > >> I'm guessing that you can do similar things to CastRecordDescriptor
> > > >>>> if you want to handle general cases in that region.
> > > >>>>
> > > >>> Or, you can inject a cast-record function in the loading pipeline
> > > >> so that you can defer the runtime-type-check/cast to that function
> > > instead
> > > >> of doing that in the parser.
> > > >>
> > > >>
> > > >> On Fri, Apr 29, 2016 at 3:25 PM, Yingyi Bu <[email protected]>
> > wrote:
> > > >>
> > > >> My answer is inlined.
> > > >>>
> > > >>> My feeling is SerializerDeserializer offers another level of
> > > abstraction
> > > >>>>> but with output I can write value directly without construct
> AType
> > > >>>>>
> > > >>>> object.
> > > >>>
> > > >>>> I am wondering if there are any preferences over these two?
> > > >>>>>
> > > >>>> I agree with you. However, a SerializerDeserializer has to be
> > > stateless,
> > > >>> hence it cannot be used at runtime for complex type objects such as
> > > >>> records and lists,
> > > >>> because it will create a lot Java objects.
> > > >>>
> > > >>> in other words, parser has to guarantee that the
> > > >>>>> processed records has to match the dataset
> definition(non-optional
> > > >>>>> attribute cannot have null value). I tried to assign null value
> to
> > > >>>>>
> > > >>>> non-null
> > > >>>
> > > >>>> attributes. It will be inserted successfully but read records will
> > > have
> > > >>>>> problem.
> > > >>>>>
> > > >>>> That sounds right to me.  Please file a JIRA issue and assign to
> > you (
> > > >>> if you're working on that).
> > > >>> I'm guessing that you can do similar things to CastRecordDescriptor
> > > >>> if you want to handle general cases in that region.
> > > >>>
> > > >>> 3. Set to null or skip
> > > >>>>> For optional(nullable) attributes, if I want to insert a record
> > with
> > > >>>>>
> > > >>>> null
> > > >>>
> > > >>>> value on that attribute. Should I assign null value or should I
> just
> > > >>>>>
> > > >>>> skip
> > > >>>
> > > >>>> it? (Probably this is related to the missing attribute that Yingyi
> > > >>>>> mentioned today?)
> > > >>>>>
> > > >>>> Assign null value.
> > > >>> Missing means the field doesn't exist in a record at all.
> > > >>>
> > > >>> Best,
> > > >>> Yingyi
> > > >>>
> > > >>>
> > > >>> On Fri, Apr 29, 2016 at 2:06 PM, Xikui Wang <[email protected]>
> wrote:
> > > >>>
> > > >>> Hi devs,
> > > >>>>
> > > >>>> I came across several questions while I was constructing records
> in
> > > >>>> AsterixDB.  Hope someone can help me clear the confusion. :)
> > > >>>>
> > > >>>> 1. Write directly to data output or use SerializerDeserializer
> > > >>>> I am working with AbstractDataParser now. I see people using
> > different
> > > >>>> ways
> > > >>>> to append attributes to data output. Either use:
> > > >>>> output.Write(typetag.serialize());
> > > >>>> output.WriteInt(0);
> > > >>>> to write into data output directly, or
> > > >>>> use AInt8SerializerDeserializer.serialize(int8Serde) to serialize
> a
> > > >>>> AINT8
> > > >>>> instance to output. *SerializerDeserializer uses writeByte to
> write
> > > >>>> output.
> > > >>>>
> > > >>>> My feeling is SerializerDeserializer offers another level of
> > > abstraction
> > > >>>> but with output I can write value directly without construct AType
> > > >>>> object.
> > > >>>> I am wondering if there are any preferences over these two?
> > > >>>>
> > > >>>> 2. RecordType validation after parser but before add to frame?
> > > >>>> My observation is after parser finish writing the output and pass
> it
> > > to
> > > >>>> next level, there is no such validation that checks whether
> > > non-optional
> > > >>>> field is null or not. In other words, parser has to guarantee that
> > the
> > > >>>> processed records has to match the dataset definition(non-optional
> > > >>>> attribute cannot have null value). I tried to assign null value to
> > > >>>> non-null
> > > >>>> attributes. It will be inserted successfully but read records will
> > > have
> > > >>>> problem.
> > > >>>>
> > > >>>> 3. Set to null or skip
> > > >>>> For optional(nullable) attributes, if I want to insert a record
> with
> > > >>>> null
> > > >>>> value on that attribute. Should I assign null value or should I
> just
> > > >>>> skip
> > > >>>> it? (Probably this is related to the missing attribute that Yingyi
> > > >>>> mentioned today?)
> > > >>>>
> > > >>>> Thanks for your help.
> > > >>>>
> > > >>>> Best,
> > > >>>> Xikui
> > > >>>>
> > > >>>>
> > > >>>
> > > >
> > >
> >
>

Re: Questions of building record in AsterixDB

Reply via email to