To Stephen

(Continuing on-list but asking if it would now be appropriate for
Stephen and I to discuss off-list)

Hi Stephen, I have be queuing up a write-to-you for a while.

On Thursday, February 26, 2015 11:37:50 AM Stephen Allen wrote:
> On Wed, Feb 25, 2015 at 6:56 AM, Chris Dollin <[email protected]>
> wrote:
> > On 02/25/2015 11:30 AM, Andy Seaborne wrote:
> >> Final call for Jena 2.13.0.
> >
> > Stephen wrote:
> >  I finished up and commited some outstanding changes I had for jena-text.
> >
> >>> I
> >>> added the ability to specify an analyzer for the query text itself that
> >>> was
> >>> different than the one used for the document.  I also added some
> >>> documentation explaining it on the site.
> >>
> >> Is there a JIRA for these changes?  I have only a superficial
> >> understanding here
> >> but is any of this  related to JENA-686?
> >>
> >> Stephen+Chris : maybe some discussion of plans and intentions on the dev@
> >> list?
> >
> > Sure. I have some notes about what the 686 changes are about I can
> > transcribe. I have been making the (originally small) changes for
> > 686 compatible with master and have (rightly or wrongly) been delaying
> > discussion until I had something that seemed to be sound.
> >
> > Right Now I'm merging in the latest master changes and am expecting to
> > make a pull request this PM.

And yesterday I successfully integrated against the latest master.

> > I'm guessing that it's unlikely the changes will be reviewed in time
> > to make it into 2.13.0?

We (I) decided that the changes would not have time to be properly reviewed
for 2.13.0 but wish them to be integrated as soon as convenient.

> The query analyzer change is pretty separate from JENA-686, it just exposes
> a capability that Lucene already has.  This is useful for example if you
> are using the StandardAnalyzer to tokenize the stored document, but perhaps
> you want to use one that tokenizes the query string differently.  You
> already could do this with jena-text's Solr implementation, since the
> configuration for that is controlled via the Solr config file.

Yes, the analyzer change doesn't affect JENA-686 except that it's treading
in the same code.

> The conjunctive query idea of Chris' is also something I would look forward
> to.  It actually looks like I may have implemented a feature that Chris
> needed, the ability to specify a custom TextDocProducer.

Yes, our fork had that capability in it. Yesterday I integrated our fork with
your code and switched to the assembler vocabulary you were using.

>  Chris: I would be interested to see your approach for this.

(Credit where credit's due; the code for this was developed by my
colleagues at Epimorphics.)

> Are you planning on waiting until all statements have been inserted
> then querying the RDF store to regenerate the documents for subjects
> that have been changed?  How do you handle triple deletion?

[Answer to follow, so as not to delay this message any more ...]

> I implemented the custom TextDocProducer for a slightly different reason,
> which was to handle triple deletions and remove the document from the
> lucene index.  However, my triple deletion code is kind of a hack (I am
> only currently indexing rdfs:label, and my application enforces a
> cardinality of 1 for that property, so I can just delete all documents with
> a given subject and predicate).  The index does not actually keep the value
> of the document, it only indexes it, so this solution would not work in the
> general case.  I would propose in the future that we actual store and not
> just index the document so that it can be appropriately identified and
> deleted.  This would require a change to existing Lucene databases (we
> should provide a tool to reindex existing data).  An alternative to
> actually storing the value would be to generate a hash of the
> subject+predicate+object and store that as an identifier.

[Likewise]

> Chris, I see in the JIRA that you talk about committing work to a branch,
> but I can't seem to locate it.  Is this in github somewhere?

Yes, we have a fork of apache-jena

    https://github.com/epimorphics/jena-config-doc-producer

and the branch is

    updated-text-indexing

Chris
-- 
Chris "allusive" Dollin

Reply via email to