To Stephen (Continuing on-list but asking if it would now be appropriate for Stephen and I to discuss off-list)
Hi Stephen, I have be queuing up a write-to-you for a while. On Thursday, February 26, 2015 11:37:50 AM Stephen Allen wrote: > On Wed, Feb 25, 2015 at 6:56 AM, Chris Dollin <[email protected]> > wrote: > > On 02/25/2015 11:30 AM, Andy Seaborne wrote: > >> Final call for Jena 2.13.0. > > > > Stephen wrote: > > I finished up and commited some outstanding changes I had for jena-text. > > > >>> I > >>> added the ability to specify an analyzer for the query text itself that > >>> was > >>> different than the one used for the document. I also added some > >>> documentation explaining it on the site. > >> > >> Is there a JIRA for these changes? I have only a superficial > >> understanding here > >> but is any of this related to JENA-686? > >> > >> Stephen+Chris : maybe some discussion of plans and intentions on the dev@ > >> list? > > > > Sure. I have some notes about what the 686 changes are about I can > > transcribe. I have been making the (originally small) changes for > > 686 compatible with master and have (rightly or wrongly) been delaying > > discussion until I had something that seemed to be sound. > > > > Right Now I'm merging in the latest master changes and am expecting to > > make a pull request this PM. And yesterday I successfully integrated against the latest master. > > I'm guessing that it's unlikely the changes will be reviewed in time > > to make it into 2.13.0? We (I) decided that the changes would not have time to be properly reviewed for 2.13.0 but wish them to be integrated as soon as convenient. > The query analyzer change is pretty separate from JENA-686, it just exposes > a capability that Lucene already has. This is useful for example if you > are using the StandardAnalyzer to tokenize the stored document, but perhaps > you want to use one that tokenizes the query string differently. You > already could do this with jena-text's Solr implementation, since the > configuration for that is controlled via the Solr config file. Yes, the analyzer change doesn't affect JENA-686 except that it's treading in the same code. > The conjunctive query idea of Chris' is also something I would look forward > to. It actually looks like I may have implemented a feature that Chris > needed, the ability to specify a custom TextDocProducer. Yes, our fork had that capability in it. Yesterday I integrated our fork with your code and switched to the assembler vocabulary you were using. > Chris: I would be interested to see your approach for this. (Credit where credit's due; the code for this was developed by my colleagues at Epimorphics.) > Are you planning on waiting until all statements have been inserted > then querying the RDF store to regenerate the documents for subjects > that have been changed? How do you handle triple deletion? [Answer to follow, so as not to delay this message any more ...] > I implemented the custom TextDocProducer for a slightly different reason, > which was to handle triple deletions and remove the document from the > lucene index. However, my triple deletion code is kind of a hack (I am > only currently indexing rdfs:label, and my application enforces a > cardinality of 1 for that property, so I can just delete all documents with > a given subject and predicate). The index does not actually keep the value > of the document, it only indexes it, so this solution would not work in the > general case. I would propose in the future that we actual store and not > just index the document so that it can be appropriately identified and > deleted. This would require a change to existing Lucene databases (we > should provide a tool to reindex existing data). An alternative to > actually storing the value would be to generate a hash of the > subject+predicate+object and store that as an identifier. [Likewise] > Chris, I see in the JIRA that you talk about committing work to a branch, > but I can't seem to locate it. Is this in github somewhere? Yes, we have a fork of apache-jena https://github.com/epimorphics/jena-config-doc-producer and the branch is updated-text-indexing Chris -- Chris "allusive" Dollin
