Re: [Owlim-discussion] BigOWLIM issues

Ing . Peter Kostelník PhD . Tue, 05 Jan 2010 08:35:22 -0800

hi there,

so .. in our team, we had a discussion about the FTS currently implemented
and here are some comments/ideas (all illustrating queries will be in
SPARQL):


what does it look like right now:

PREFIX fts: http://www.ontotext.com/
PREFIX foo: http://guess.what/
SELECT ?resource WHERE {
  ?resource foo:searchedProperty ?var.
  <contains:some:terms> fts:exactMatch ?var.
}

what is not nice, searched list of terms <contains:some:terms> is realy
not a valid URI ..  and I guess, we should be cool and hold the standards
and semantics of the languages used :) .. another problem is, that it is
quite hard (and is it even possible?) to define logical relations of
searched terms in the query, nor to use wildcards or fuzzy queries (when
talking about the lucene) ..

lets look, how the LuceneSail deals with this (if interested in, see
https://dev.nepomuk.semanticdesktop.org/wiki/LuceneSail, the idea is sexy,
but, unfortunately, the implementation erases the brain a bit :) ):

PREFIX search:   <http://www.openrdf.org/contrib/lucenesail#>
PREFIX foo: http://guess.what/
SELECT ?resource ?score ?snippet WHERE {
?resource search:matches ?match.
?match search:query "contains~ && (searched~ || terms~0.2)";
  search:property foo:searchedProperty; // optional
  search:score ?score; // optional
  search:snippet ?snippet. // optional
}

LuceneSail comes with the virtual triplets enabling direct definition of
the search engine specific query (e.g. lucene query in its own syntax),
see "search:property" .. used this way, it enables to use the logical
operators (or lucene features, such as fuzzy or wildcards, .. ) and to
retrieve the score or matched snipped of retrieved value ..
the ?match variable is bound to the graph of all its direct literal
properties, the URI related to ?match represents the lucene document, the
literal properties represent the document fields ..

this approach is [syntac/seman]ticaly correct, pretty, but there are, of
course, also innumerous other ways how to use the FTS in a simple way ..

another clear and staightforward possibility is to extend the SPARQL with
custom filter function (I guess it would be also possible in SeRQL) .. it
would nicely support everything what FTS needs with the full usage of
underlying indexing/search engine, for example:

PREFIX fts: http://www.ontotext.com/
PREFIX foo: http://guess.what/
SELECT ?resource WHERE {
  ?resource foo:searchedProperty ?var.
  FILTER (ftsQuery("contains~ && (some~ || terms~0.2)", ?var))
}

or for example (or whatever):
FILTER (ftsQuery("contains~ && (some~ || terms~0.2)", ?var) && lang("bg"))

where the FILTER function can be fed directly by the search engine
specific query
.. of course, to make the sense, the matching results should be ordered by
the score ..

maybe another idea is to enable the score treshold :) , e.g.:
  FILTER (ftsQuery("contains~ && (some~ || terms~0.2)", ?var) > 0.8)

of course, this is deadly complicated issue and there are lot of problems
to design the solution generic enough to be independent from underlying
indexing/search engine implementation .. hard to generalize it to define
custom term analyzers, scoring functions, term logical relations,
(gener...@#^%$@#$?) FTS query syntax .. lot of blood to loose .. so, it
seems to be just on the trade-off .. and the lucene looks like the perfect
solution ..

what do you think about it? :)

best wishes,
                         Peter K.


> Hey Peter,
>
> Find attached a small piece of documentation which attempts to explain how
> the
> current FTS implementation is configured and works.
> Any feedback is (as always) welcome!
>
>
> Cheers,
> Ivan
>
> On Tuesday 05 January 2010 16:25:01 Ing. Peter Kostelník PhD. wrote:
>> hi there ..
>>
>> well, the FTS is crucial for us (as it is crucial for any search engine)
>> ..
>> for now, we can live a while with our current full text index/search
>> implementation, but we would be, for sure, happy to have some working,
>> stable and and fast FTS customized directly for the BigOWLIM :)
>>
>> anyway, of course, we are interested also in the current FTS
>> implementation .. if nothing else, we would be pleased to help you with
>> the testing ..
>>
>> best wishes,
>>                               Peter K.
>>
>> > Hello Peter
>> >
>> > we have multiple ways to support FTS within OWLIM. One of them is
>> based
>> > on proprietary FTI implementation and another one is based on Lucene.
>> We
>> > are not making much noise about them, because they are still not
>> properly
>> > documented, we are also working to further speed up both the indexing
>> and
>> > the queries. Still, Ivan can help you to try the current
>> implementation,
>> > if
>> > this feature is so important for you
>> >
>> > Regards
>> > Naso
>> >
>> > ----------------------------------------------------------
>> > Atanas Kiryakov
>> > Executive Director of Ontotext AD, http://www.ontotext.com
>> > Sirma Group, http://www.sirma.bg
>> > Phone: (+359 2) 974 61 44; Fax: 975 3226
>> > ----------------------------------------------------------
>> > Fortes fortuna adiuvat.
>> > ----- Original Message -----
>> > From: "Ing. Peter Kostelník PhD." <[email protected]>
>> > To: <[email protected]>
>> > Sent: Tuesday, January 05, 2010 3:24 PM
>> > Subject: Re: [Owlim-discussion] BigOWLIM issues
>> >
>> >> hi there,
>> >>
>> >> to the lucene topic: now we are using the lucene sail, but we were
>> >> forced
>> >> to hack it to be able to put it above the bigowlim .. it is not
>> >> implemented too well, it has problems with handling the transactions
>> >> dealing with the big amounts of data (when building the imports, it
>> >> holds
>> >> everything in the memory, so it, logically, has to go down on the
>> heap
>> >> space) .. we also found serious problems in handling the index for
>> >> multi-value properties, context data, but also in the implementation
>> of
>> >> query evaluation .. so, if we wanted to avoid of building our own
>> >> external
>> >> full-text index, integration of lucene sail required many hacks and
>> >> workarounds directly in the lucene sail code ..
>> >>
>> >> and this are, in our oppinion, the most important issues, which
>> should
>> >> be
>> >> taken into account .. for sure, with performance optimizations :)
>> >>
>> >> by the way, don't you plan to use the new release of lucene 3.0.0,
>> which
>> >> has the really many usefull optimizations?
>> >>
>> >> cheers,
>> >>                             Peter K.
>> >>
>> >>> Hi Peter,
>> >>>
>> >>> Both issues are currently in progress. Lucene integration is
>> currently
>> >>> only
>> >>> experimental and is not really flexible or stable to use in
>> production
>> >>> (it
>> >>> is
>> >>> not even documented). Our goal is to provide enough flexibility so
>> that
>> >>> e.g.
>> >>> custom analyzers and result rankings are easily pluggable by the
>> engine
>> >>> user.
>> >>>
>> >>> It will be of great value to us if you could summarize the full-text
>> >>> search
>> >>> flexibility you will need.
>> >>>
>> >>> The "when" question is a lot harder to answer. I can't give you any
>> >>> concrete
>> >>> due dates currently, but this is something on the table now and we
>> >>> should
>> >>> be
>> >>> able to deliver results within the next couple of months. I hope I'm
>> >>> not
>> >>> too
>> >>> wrong about that... :)
>> >>>
>> >>>
>> >>> Cheers and have a happy new year!
>> >>> Ivan
>> >>>
>> >>> On Friday 18 December 2009 13:01:51 Ing. Peter Kostelník PhD. wrote:
>> >>>> hi there,
>> >>>>
>> >>>> we're planning to use the BigOWLIM as the production backend, so
>> I've
>> >>>> got
>> >>>> just the few questions regarding the further BigOWLIM developement
>> ..
>> >>>>
>> >>>> 1. I've noticed, that in 3.2.6 snapshot, there is the direct
>> >>>> dependency
>> >>>> to
>> >>>> Lucene 2.9 core .. so, I assume, you're planning to integrate the
>> >>>> lucene
>> >>>> as the fulltext index/search engine .. pls, would there be the
>> support
>> >>>> for
>> >>>> configuring the lucene? I mean the essential issues, such as adding
>> >>>> custom
>> >>>> analysers/tokenizers, fuzzy search support, custom query parsers,
>> etc.
>> >>>> ?
>> >>>> .. when do you plan to integrate the lucene?
>> >>>>
>> >>>> 2. is there some possibility to force BigOWLIM to perform logging
>> in
>> >>>> some
>> >>>> reasonable way? .. now everything is flushed into (I guess)
>> >>>> System.out/err
>> >>>> .. and, well, this is not so suitable for production backend ..
>> >>>>
>> >>>> thanks in advance,
>> >>>> best wishes and merry christmas,
>> >>>>                                   Peter K.
>> >>>>
>> >>>> _______________________________________________
>> >>>> OWLIM-discussion mailing list
>> >>>> [email protected]
>> >>>> http://ontotext.com/mailman/listinfo/owlim-discussion
>> >>
>> >> _______________________________________________
>> >> OWLIM-discussion mailing list
>> >> [email protected]
>> >> http://ontotext.com/mailman/listinfo/owlim-discussion
>>
>> _______________________________________________
>> OWLIM-discussion mailing list
>> [email protected]
>> http://ontotext.com/mailman/listinfo/owlim-discussion
>
>
>


_______________________________________________
OWLIM-discussion mailing list
[email protected]
http://ontotext.com/mailman/listinfo/owlim-discussion

Re: [Owlim-discussion] BigOWLIM issues

Reply via email to