Re: Socket feed questions

Jianfeng Jia Thu, 29 Oct 2015 12:24:05 -0700

Hi Devs,

I have two related questions,
1. Is there any example code of using UDF in feed-adapter?
2. Can we use AQL function in those kind of feed UDFs?


Thank you.

On Tue, Oct 27, 2015 at 9:54 PM, Michael Carey <[email protected]> wrote:

> Thanks!
>
> On 10/27/15 9:48 AM, Raman Grover wrote:
>
>> Hi,
>>
>>
>> In the case when data is being received from an external source (e.g.
>> during feed ingestion), a slow rate of arrival of data may result in
>> excessive delays until the data is deposited into the target dataset and
>> made accessible to queries. Data moves along a data ingestion pipeline
>> between operators as packed fixed size frames. The default behavior is to
>> wait for the frame to be full before dispatching the contained data to the
>> downstream operator. However, as noted, this may not suit all scenarios
>> particularly when data source is sending data at a low rate. To cater to
>> different scenarios, AsterixDB allows configuring the behavior. The
>> different options are described next.
>>
>> *Push data downstream when*
>> (a) Frame is full (default)
>> (b) At least N records (data items) have been collected into a partially
>> filled frame
>> (c) At least T seconds have elapsed since the last record was put into
>> the frame
>>
>> *How to configure the behavior?*
>> At the time of defining a feed, an end-user may specify configuration
>> parameters that determine the runtime behavior (options (a), (b) or (c)
>> from above).
>>
>> The parameters are described below:
>>
>> /"parser-policy"/: A specific strategy chosen from a set of pre-defined
>> values -
>>   (i) / "frame_full"/
>>  This is the default value. As the name suggests, this choice causes
>> frames to be pushed by the feed adaptor only when there isn't sufficient
>> space for an additional record to fit in. This corresponds to option (a).
>>
>>  (ii) / "counter_timer_expired" /
>>  Use this as the value if you wish to set either option (b) or (c)  or a
>> combination of both.
>>
>> *Some Examples*
>> *
>> *
>> 1) Pack a maximum of 100 records into a data frame and push it downstream.
>>
>>  create feed my_feed using my_adaptor
>> (("parser-policy"="counter_timer_expired"), ("batch-size"="100"), ...
>> other parameters);
>>
>> 2) Wait till 2 seconds and send however many records collected in a frame
>> downstream.
>>  create feed my_feed using my_adaptor
>> (("parser-policy"="counter_timer_expired"), ("batch-interval"="2")...
>> other parameters);
>>
>> 3) Wait till 100 records have been collected into a data frame or 2
>> seconds have elapsed since the last record was put into the current data
>> frame.
>>  create feed my_feed using my_adaptor
>> (("parser-policy"="counter_timer_expired"), ("batch-interval"="2"),
>> ("batch-size"="100"),... other parameters);
>>
>>
>> *Note*
>> The above config parameters are not specific to using a particular
>> implementation of an adaptor but are available for use with any feed
>> adaptor. Some adaptors that ship with AsterixDB use different default
>> values for above to suit their specific scenario. E.g. the pull-based
>> twitter adaptor uses "counter_timer_expired" as the "parser-policy" and
>> sets the  parameter "batch-interval".
>>
>>
>> Regards,
>> Raman
>> PS: The names of the parameters described above are not as intuitive as
>> one would like them to be. The names need to be changed.
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Oct 22, 2015 at 9:09 AM, Mike Carey <[email protected] <mailto:
>> [email protected]>> wrote:
>>
>>     I think we need to have tuning parameters - like batch size and
>>     maximum tolerable latency (in case there's a lull and you still
>>     want to push stuff with some worst-case delay). @Raman Grover -
>>     remind me (us) what's available in this regard?
>>
>>     On 10/22/15 4:29 AM, Pääkkönen Pekka wrote:
>>
>>>
>>>     Hi,
>>>
>>>     Yes, you are right. I tried sending a larger amount of data, and
>>>     data is now stored to the database.
>>>
>>>     Does it make sense to configure a smaller batch size in order to
>>>     get more frequent writes?
>>>
>>>     Or would it significantly impact performance?
>>>
>>>     -Pekka
>>>
>>>     Data moves through the pipeline in frame-sized batches, so one
>>>
>>>     (uniformed :-)) guess is that you aren't running very long, and
>>>     you're
>>>
>>>     only seeing the data flow when you close because only then do you
>>>     have a
>>>
>>>     batch's worth.  Is that possible?  You can test this by running
>>>     longer
>>>
>>>     (more data) and seeing if you start to see the expected incremental
>>>
>>>     flow/inserts. (And we need tunability in this area, e.g.,
>>>     parameters on
>>>
>>>     how much batching and/or low much latency to tolerate on each feed.)
>>>
>>>     On 10/21/15 4:45 AM, Pääkkönen Pekka wrote:
>>>
>>>     >
>>>
>>>     > Hi,
>>>
>>>     >
>>>
>>>     > Thanks, now I am able to create a socket feed, and save items to
>>> the
>>>
>>>     > dataset from the feed.
>>>
>>>     >
>>>
>>>     > It seems that data items are written to the dataset after I close
>>> the
>>>
>>>     > socket at the client.
>>>
>>>     >
>>>
>>>     > Is there some way to indicate to AsterixDB feed (with a newline or
>>>
>>>     > other indicator) that data can be written to the database, when the
>>>
>>>     > connection is open?
>>>
>>>     >
>>>
>>>     > After I close the socket at the client, the feed seems to close
>>> down.
>>>
>>>     > Or is it only paused, until it is resumed?
>>>
>>>     >
>>>
>>>     > -Pekka
>>>
>>>     >
>>>
>>>     > Hi Pekka,
>>>
>>>     >
>>>
>>>     > That's interesting, I'm not sure why the CC would appear as being
>>> down
>>>
>>>     >
>>>
>>>     > to Managix. However if you can access the web console, it that
>>>
>>>     >
>>>
>>>     > evidently isn't the case.
>>>
>>>     >
>>>
>>>     > As for data ingestion via sockets, yes it is possible, but it kind
>>> of
>>>
>>>     >
>>>
>>>     > depends on what's meant by sockets. There's no tutorial for it, but
>>>
>>>     >
>>>
>>>     > take a look at SocketBasedFeedAdapter in the source, as well as
>>>
>>>     >
>>>
>>>     >
>>> https://github.com/kisskys/incubator-asterixdb/blob/kisskys/indexonlyhilbertbtree/asterix-experiments/src/main/java/org/apache/asterix/experiment/client/SocketTweetGenerator.java
>>>
>>>     >
>>>
>>>     > for some examples of how it works.
>>>
>>>     >
>>>
>>>     > Hope that helps!
>>>
>>>     >
>>>
>>>     > Thanks,
>>>
>>>     >
>>>
>>>     > -Ian
>>>
>>>     >
>>>
>>>     > On Mon, Oct 19, 2015 at 10:15 PM, Pääkkönen Pekka
>>>
>>>     ><[email protected]> <mailto:[email protected]> wrote:
>>>
>>>     > > Hi Ian,
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > > Thanks for the reply.
>>>
>>>     > >
>>>
>>>     > > I compiled AsterixDB v0.87 and started it.
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > > However, I get the following warnings:
>>>
>>>     > >
>>>
>>>     > > INFO: Name:my_asterix
>>>
>>>     > >
>>>
>>>     > > Created:Mon Oct 19 08:37:16 UTC 2015
>>>
>>>     > >
>>>
>>>     > > Web-Url:http://192.168.101.144:19001
>>>
>>>     > >
>>>
>>>     > > State:UNUSABLE
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > > WARNING!:Cluster Controller not running at master
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > > Also, I see the following warnings in my_asterixdb1.log. there
>>>     are no
>>>
>>>     > > warnings or errors in cc.log
>>>
>>>     > >
>>>
>>>     > > “
>>>
>>>     > >
>>>
>>>     > > Oct 19, 2015 8:37:39 AM
>>>
>>>     > > org.apache.hyracks.api.lifecycle.LifeCycleComponentManager
>>> configure
>>>
>>>     > >
>>>
>>>     > > SEVERE: LifecycleComponentManager configured
>>>
>>>     > >
>>> org.apache.hyracks.api.lifecycle.LifeCycleComponentManager@7559ec47
>>>
>>>     > >
>>>
>>>     > > ..
>>>
>>>     > >
>>>
>>>     > > INFO: Completed sharp checkpoint.
>>>
>>>     > >
>>>
>>>     > > Oct 19, 2015 8:37:40 AM
>>>     org.apache.asterix.om.util.AsterixClusterProperties
>>>
>>>     > > getIODevices
>>>
>>>     > >
>>>
>>>     > > WARNING: Configuration parameters for nodeId my_asterix_node1
>>>     not found. The
>>>
>>>     > > node has not joined yet or has left.
>>>
>>>     > >
>>>
>>>     > > Oct 19, 2015 8:37:40 AM
>>>     org.apache.asterix.om.util.AsterixClusterProperties
>>>
>>>     > > getIODevices
>>>
>>>     > >
>>>
>>>     > > WARNING: Configuration parameters for nodeId my_asterix_node1
>>>     not found. The
>>>
>>>     > > node has not joined yet or has left.
>>>
>>>     > >
>>>
>>>     > > Oct 19, 2015 8:38:38 AM
>>>
>>>     > > org.apache.hyracks.control.common.dataset.ResultStateSweeper
>>> sweep
>>>
>>>     > >
>>>
>>>     > > INFO: Result state cleanup instance successfully completed.”
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > > I seems that AsterixDB is running, and I can access it at port
>>> 19001.
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > > The documentation shows ingestion of tweets, but I would be
>>>     interested in
>>>
>>>     > > using sockets.
>>>
>>>     > >
>>>
>>>     > > Is it possible to ingest data from sockets?
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > > Regards,
>>>
>>>     > >
>>>
>>>     > > -Pekka
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > > Hey there Pekka,
>>>
>>>     > >
>>>
>>>     > > Your intuition is correct, most of the newer feeds features are
>>> in the
>>>
>>>     > >
>>>
>>>     > > current master branch and not in the (very) old 0.8.6 release.
>>>     If you'd
>>>
>>>     > >
>>>
>>>     > > like to experiment with them you'll have to build from source.
>>> The
>>>     details
>>>
>>>     > >
>>>
>>>     > > about that are here:
>>>
>>>     > >
>>>
>>>     > >
>>> https://asterixdb.incubator.apache.org/dev-setup.html#setting-up-an-asterix-development-environment-in-eclipse
>>>
>>>     > >
>>>
>>>     > > , but they're probably a bit overkill for just trying to get the
>>>     compiled
>>>
>>>     > >
>>>
>>>     > > binaries. For that all you really need to do is :
>>>
>>>     > >
>>>
>>>     > > - Clone Hyracks from git
>>>
>>>     > >
>>>
>>>     > > - 'mvn clean install -DskipTests'
>>>
>>>     > >
>>>
>>>     > > - Clone AsterixDB
>>>
>>>     > >
>>>
>>>     > > - 'mvn clean package -DskipTests'
>>>
>>>     > >
>>>
>>>     > > Then, the binaries will sit in asterix-installer/target
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > > For an example, the documentation shows how to set up a feed
>>> that's
>>>
>>>     > >
>>>
>>>     > > ingesting Tweets:
>>>
>>>     > >
>>>
>>>     > >
>>> https://asterix-jenkins.ics.uci.edu/job/asterix-test-full/site/asterix-doc/feeds/tutorial.html
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > > Thanks,
>>>
>>>     > >
>>>
>>>     > > -Ian
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > > On Wed, Oct 7, 2015 at 9:48 PM, Pääkkönen Pekka
>>>     <[email protected]> <mailto:[email protected]>
>>>
>>>     > >
>>>
>>>     > > wrote:
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     > >> Hi,
>>>
>>>     > >
>>>
>>>     > >>
>>>
>>>     > >
>>>
>>>     > >>
>>>
>>>     > >
>>>
>>>     > >>
>>>
>>>     > >
>>>
>>>     > >> I would like to experiment with a socket-based feed.
>>>
>>>     > >
>>>
>>>     > >>
>>>
>>>     > >
>>>
>>>     > >> Can you point me to an example on how to utilize them?
>>>
>>>     > >
>>>
>>>     > >>
>>>
>>>     > >
>>>
>>>     > >> Do I need to install 0.8.7-snapshot version of AsterixDB in
>>> order to
>>>
>>>     > >
>>>
>>>     > >> experiment with feeds?
>>>
>>>     > >
>>>
>>>     > >>
>>>
>>>     > >
>>>
>>>     > >>
>>>
>>>     > >
>>>
>>>     > >>
>>>
>>>     > >
>>>
>>>     > >> Regards,
>>>
>>>     > >
>>>
>>>     > >>
>>>
>>>     > >
>>>
>>>     > >> -Pekka Pääkkönen
>>>
>>>     > >
>>>
>>>     > >>
>>>
>>>     > >
>>>
>>>     > >
>>>
>>>     >
>>>
>>>
>>
>>
>>
>> --
>> Raman
>>
>
>


-- 

-----------------
Best Regards

Jianfeng Jia
Ph.D. Candidate of Computer Science
University of California, Irvine

Re: Socket feed questions

Reply via email to