On Wed, Feb 25, 2015 at 6:56 AM, Chris Dollin <[email protected]>
wrote:

> On 02/25/2015 11:30 AM, Andy Seaborne wrote:
>
>> Final call for Jena 2.13.0.
>>>>
>>>
> Stephen wrote:
>
>  I finished up and commited some outstanding changes I had for jena-text.
>>> I
>>> added the ability to specify an analyzer for the query text itself that
>>> was
>>> different than the one used for the document.  I also added some
>>> documentation explaining it on the site.
>>>
>>
>> Is there a JIRA for these changes?  I have only a superficial
>> understanding here
>> but is any of this  related to JENA-686?
>>
>> Stephen+Chris : maybe some discussion of plans and intentions on the dev@
>> list?
>>
>
> Sure. I have some notes about what the 686 changes are about I can
> transcribe. I have been making the (originally small) changes for
> 686 compatible with master and have (rightly or wrongly) been delaying
> discussion until I had something that seemed to be sound.
>
> Right Now I'm merging in the latest master changes and am expecting to
> make a pull request this PM.
>
> I'm guessing that it's unlikely the changes will be reviewed in time
> to make it into 2.13.0?
>
>
The query analyzer change is pretty separate from JENA-686, it just exposes
a capability that Lucene already has.  This is useful for example if you
are using the StandardAnalyzer to tokenize the stored document, but perhaps
you want to use one that tokenizes the query string differently.  You
already could do this with jena-text's Solr implementation, since the
configuration for that is controlled via the Solr config file.

The conjunctive query idea of Chris' is also something I would look forward
to.  It actually looks like I may have implemented a feature that Chris
needed, the ability to specify a custom TextDocProducer.  Chris: I would be
interested to see your approach for this.  Are you planning on waiting
until all statements have been inserted then querying the RDF store to
regenerate the documents for subjects that have been changed?  How do you
handle triple deletion?

I implemented the custom TextDocProducer for a slightly different reason,
which was to handle triple deletions and remove the document from the
lucene index.  However, my triple deletion code is kind of a hack (I am
only currently indexing rdfs:label, and my application enforces a
cardinality of 1 for that property, so I can just delete all documents with
a given subject and predicate).  The index does not actually keep the value
of the document, it only indexes it, so this solution would not work in the
general case.  I would propose in the future that we actual store and not
just index the document so that it can be appropriately identified and
deleted.  This would require a change to existing Lucene databases (we
should provide a tool to reindex existing data).  An alternative to
actually storing the value would be to generate a hash of the
subject+predicate+object and store that as an identifier.

Chris, I see in the JIRA that you talk about committing work to a branch,
but I can't seem to locate it.  Is this in github somewhere?

-Stephen

Reply via email to