Re: [Dev] [DAS] Indexing arbitrary fields

Malith Dhanushka Wed, 02 Dec 2015 03:12:23 -0800

Hi Folks,

We had an offline chat about this.


Since indexing all the arbitrary fields is not feasible with the current
architecture, requirement of indexing arbitrary fields in log analyzer will
be handled in Log analyzer REST API. Idea is to compare the incoming event
with existing schema which is kept in in-memory and if there is a change
then to update the table schema.

Overriding table schema will make event sink configuration inconsistent
with table schema. To avoid that event sink feature needs to be improved in
order to support merging table schemas. For that event persist feature
should have a flag to enable/disable merging table schemas.

Thanks,

On Wed, Dec 2, 2015 at 1:30 PM, Sinthuja Ragendran <[email protected]>
wrote:

> Hi,
>
> On Wed, Dec 2, 2015 at 11:05 AM, Anjana Fernando <[email protected]> wrote:
>
>> On Wed, Dec 2, 2015 at 10:17 AM, Sachith Withana <[email protected]>
>> wrote:
>>
>>> Now that we are using logstash out of the box, without the DASConnector,
>>> it won't do that.
>>>
>>> The logstash would just start publishing and with the current design,
>>> AFAIK the schema setting would be handled by the LAS server,
>>>
>>
>> Oh yeah, I see ..
>>
>>
>>>
>>> BTW for that requirement, can we provide a way to allow indexing all the
>>> columns?
>>>
>>
>> Well .. we can .. I guess this is the same that Malith request in the
>> first mail. Only thing is, we have to change the internals/architecture of
>> how we do indexing currently, the current logic is, we check the input
>> value against the table schema, and do the required indexing. For example,
>> if facets are defined, data types etc.. so if we are just saying, to index
>> all fields, it will be a new path there, and also we have to introduce a
>> new special flag for a table to say, index all. Also, we should need some
>> mechanism of figuring out the fields of a specific log type in the server,
>> where at least with the table schema, we knew what are all the fields
>> that's there for all the log types. Ideally, we need to store some metadata
>> somewhere saying, for this specific log type, these are the fields and so
>> on. Do we get some kind of a log category/type information with the
>> standard logstash HTTP connector? .. any other schema setting, storing of
>> metadata can be done in the server side, and we can cache it in-memory to
>> do fast lookups and modifications of the schema (together with some cluster
>> messaging to keep it in-sync with other nodes).
>>
>> Or else, maybe we are again back to writing our own logstash adapter
>> which will make the whole thing much simpler? ..
>>
>
> Yeah +1 , actually I was also thinking having our own logstash adaptor
> will be more better and cleaner way without complicating much. :) Simply if
> we are able to mention what are the fields that needs to be indexed in
> client side, and then make a call to LAS REST service before publishing
> data, then we can set the schema accordingly and things will work without
> any big effort .
>
> Thanks,
> Sinthuja.
>
>
>> Cheers,
>> Anjana.
>>
>>
>>>
>>> On Wed, Dec 2, 2015 at 10:11 AM, Anjana Fernando <[email protected]>
>>> wrote:
>>>
>>>> Hi Sachith,
>>>>
>>>> Doesn't the agent have the knowledge of the log types/categories and
>>>> their field information when it is initializing? .. as in, as I understood,
>>>> we give what fields needs to be sent out in the configurations, isn't that
>>>> the case? ..
>>>>
>>>> Cheers,
>>>> Anjana.
>>>>
>>>> On Wed, Dec 2, 2015 at 10:01 AM, Sachith Withana <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> There might be a slight issue. We wouldn't know the arbitrary fields
>>>>> before the log agent starts publishing, since the agent only publishes and
>>>>> we don't have control over which fields would be sent ( unless we 
>>>>> configure
>>>>> all the agents ourselves). So we would have to check for each event, if
>>>>> there are new fields apart from that are there in the schema. This is
>>>>> undesirable.
>>>>>
>>>>> And as Anjana pointed out we don't have a way to specify to index all
>>>>> the arbitrary values unless we set the schema accordingly.
>>>>>
>>>>> Is it possible to specify in the schema to index everything?
>>>>>
>>>>> On Wed, Dec 2, 2015 at 9:38 AM, Anjana Fernando <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Malith,
>>>>>>
>>>>>> The functionality which you're requesting is very specific, and from
>>>>>> DAS side, it doesn't make sense to implement this in a generic way, which
>>>>>> is not used usually. And it is anyway not the way, the log analyzer 
>>>>>> should
>>>>>> use it. The different log sources, will know their fields before they 
>>>>>> send
>>>>>> out data, it doesn't have to be checked every time an event is 
>>>>>> published. A
>>>>>> log source would instruct the log analyzer backend API, the new fields,
>>>>>> this specific log source will be sending, and with the earlier message, 
>>>>>> the
>>>>>> backend service will set the global table's schema properly, and then the
>>>>>> remote log agent will be sending out log records to be processed by the
>>>>>> server.
>>>>>>
>>>>>> Cheers,
>>>>>> Anjana.
>>>>>>
>>>>>> On Tue, Dec 1, 2015 at 6:44 PM, Malith Dhanushka <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Anjana,
>>>>>>>
>>>>>>> Yes. Requirement is for the internal log related REST API which is
>>>>>>> being written using osgi services. In the perspective of log analysis 
>>>>>>> data,
>>>>>>> we have one master table to persist all the log events from different 
>>>>>>> log
>>>>>>> sources. The way log data comes in to log REST API is as arbitrary 
>>>>>>> fields.
>>>>>>> So different log sources have different set of arbitrary fields which 
>>>>>>> leads
>>>>>>> log REST API to change the schema of master table every time it receives
>>>>>>> log events from a new/updated log source. That's what i meant inaccurate
>>>>>>> which can be solved much cleaner way by having that flag to index or 
>>>>>>> not to
>>>>>>> index arbitrary fields for a particular stream.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Malith
>>>>>>>
>>>>>>> On Tue, Dec 1, 2015 at 6:06 PM, Anjana Fernando <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Malith,
>>>>>>>>
>>>>>>>> No, it cannot be done like that. How the indexing and all happens
>>>>>>>> is, it looks up the table schema for a table and do the indexing 
>>>>>>>> according
>>>>>>>> to that. So the table schema must be set before hand. It is not a 
>>>>>>>> dynamic
>>>>>>>> thing that can be set, when arbitrary fields are sent to the receiver, 
>>>>>>>> and
>>>>>>>> it cannot always load the current schema and set it always for each 
>>>>>>>> event,
>>>>>>>> even though we can cache that information and do some operations, but 
>>>>>>>> that
>>>>>>>> gets complicated. So the idea is, it is the responsibility of the 
>>>>>>>> client to
>>>>>>>> set the target table's schema properly before hand, which may or may 
>>>>>>>> not
>>>>>>>> include arbitrary fields, and then send the data.
>>>>>>>>
>>>>>>>> Also, if this requirement is for the log analytics solution work,
>>>>>>>> as we've discussed before, there should be a whole new remote API for 
>>>>>>>> that,
>>>>>>>> and that API can do these operations inside the server, using the OSGi
>>>>>>>> services, and not the original DAS REST API. So those operations will
>>>>>>>> happen automatically while keeping the remote log related API clean.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Anjana.
>>>>>>>>
>>>>>>>> On Tue, Dec 1, 2015 at 5:13 PM, Malith Dhanushka <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Folks,
>>>>>>>>>
>>>>>>>>> Currently indexing arbitrary fields is being achieved by
>>>>>>>>> dynamically updating analytics table schema through analytics REST 
>>>>>>>>> API.
>>>>>>>>> This is not an accurate solution for a frequently updating schema. So 
>>>>>>>>> the
>>>>>>>>> ideal solution would be to have a flag in data bridge event sink
>>>>>>>>> configuration to enable/disable indexing for all arbitrary fields.
>>>>>>>>>
>>>>>>>>> WDUT?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Malith
>>>>>>>>> --
>>>>>>>>> Malith Dhanushka
>>>>>>>>> Senior Software Engineer - Data Technologies
>>>>>>>>> *WSO2, Inc. : wso2.com <http://wso2.com/>*
>>>>>>>>> *Mobile*          : +94 716 506 693
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Anjana Fernando*
>>>>>>>> Senior Technical Lead
>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>> lean . enterprise . middleware
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Malith Dhanushka
>>>>>>> Senior Software Engineer - Data Technologies
>>>>>>> *WSO2, Inc. : wso2.com <http://wso2.com/>*
>>>>>>> *Mobile*          : +94 716 506 693
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Anjana Fernando*
>>>>>> Senior Technical Lead
>>>>>> WSO2 Inc. | http://wso2.com
>>>>>> lean . enterprise . middleware
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sachith Withana
>>>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>>>> E-mail: sachith AT wso2.com
>>>>> M: +94715518127
>>>>> Linked-In: <http://goog_416592669>
>>>>> https://lk.linkedin.com/in/sachithwithana
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Anjana Fernando*
>>>> Senior Technical Lead
>>>> WSO2 Inc. | http://wso2.com
>>>> lean . enterprise . middleware
>>>>
>>>
>>>
>>>
>>> --
>>> Sachith Withana
>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>> E-mail: sachith AT wso2.com
>>> M: +94715518127
>>> Linked-In: <http://goog_416592669>
>>> https://lk.linkedin.com/in/sachithwithana
>>>
>>
>>
>>
>> --
>> *Anjana Fernando*
>> Senior Technical Lead
>> WSO2 Inc. | http://wso2.com
>> lean . enterprise . middleware
>>
>
>
>
> --
> *Sinthuja Rajendran*
> Associate Technical Lead
> WSO2, Inc.:http://wso2.com
>
> Blog: http://sinthu-rajan.blogspot.com/
> Mobile: +94774273955
>
>
>


-- 
Malith Dhanushka
Senior Software Engineer - Data Technologies
*WSO2, Inc. : wso2.com <http://wso2.com/>*
*Mobile*          : +94 716 506 693

_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] [DAS] Indexing arbitrary fields

Reply via email to