Hi Daniel, 

Thank you for your feedback. I have a few questions/clarifications that I need, 
I’ve commented inline: 

On 14 Jan 2014, at 16:12, Daniel Spicar <[email protected]> wrote:

> First a little warning: I am do not understand the workings of CRIS and
> Lucene 100% anymore as it has been a while since I created the
> GenericCondition (Lucene query syntax) and the Index structure and it used
> to be a different version of lucene. So double check anything I say that
> sounds odd ;)
> 

Sounds Excellent. I’ve made rdf.cris my new home :)

> On what level do you want to do this change? In general it is a possible
> approach to change the indexing to not index the raw URI but slugification
> loses information (in general). You don't want that as some applications
> may depend on the exact URIs (for example for ordering case sensitively).

There are applications that order the values by key? From what I understand and 
I’ve seen in the code, they field.name (which is the key) is never used. The 
only used key is the reference to the result’s URI to “florish” it later on in 
the process. As far as I can see, slugyfying the value of the fieldname must 
not have any incidence on the rest of ZZ. 

I’ve tested (on my fork) slugyfying the fieldname by simply adding the 
“toSlug()” method in the vProperty class. So far all works well and I’ve passed 
all the JUnit tests. Could you point me toward a specific case/impl that 
requires the fieldname to be an actual URI? 


> 
> But I am not sure messing with the indexing is necessary the right approach
> as some applications may depend on the current index values and their
> quirks. But you can always add a slugified URI field and perform queries on
> that. When you implement a custom condition, you need to make sure that you
> search on the correct field. Hence when the field name is a slugified URI,
> then you need to slugify the input URI and use that as the field name in
> the TermQuery.

Excellent. Custom conditions working like a charm. I took a small liberty to 
extend upon the abstractCondition class by adding a boost and a boolean clause. 
So when constructing a query, by API we can set the boost and boolean clause 
(MUST vs SHOULD) individually on each clause.

> 
> Another approach, without messing with the index directly, and when you
> want to use Lucene query syntax, is to just make sure to escape special
> characters with a backslash. This can be automated.
> 

Yes, this could be another solution, but also I have concerns with lucene not 
liking very much long fieldname, as well as the difficulty of escaping them at 
all the right places in the right manner (can only be enforced by “good 
practices”). 

> 
> 2014/1/13 Stephane Gamard <[email protected]>
> 
>> Thanks Daniel,
>> 
>> Yes I saw that. Stupid mistake from me, I thought there were all stored. I
>> think I found the problem with my boost. Currently the condition is
>> expressed as a lucene query but fails the syntax when the key is a RDF
>> uriref: http://askagfdasd.jasd#toto:hello is not valid.
>> 
>> I’m thinking about slug-fying the name of the field instead of having the
>> raw uri used as the field key. What do you think?
>> 
>> _Stephane
>> 
>> On 13 Jan 2014, at 17:34, Daniel Spicar <[email protected]> wrote:
>> 
>>> Hi Stephane
>>> 
>>> This is a prefix added to Lucene stored fields (the fields that actually
>>> get stored "as is" or unmodified in the document and returned by Lucene
>>> when asking for Documents). Lucene also creates (or can be told to do so)
>>> fields which are not "stored", thus one can analyze, tokenize, etc the
>>> original value and create fields by which Lucene can search/sort - but
>>> those fields are not returned as part of the document.
>>> 
>>> We add them to all stored fields before indexing (in the
>>> GraphIndexer.resourceToDocument method). I am not sure anymore why
>> exactly
>>> this was needed. I think there was a peculiar problem with the sort order
>>> when this was missing but I am not sure what exactly needed this
>>> "workaround".
>>> 
>>> Daniel
>>> 
>>> 
>>> 
>>> 2014/1/13 Stephane Gamard <[email protected]>
>>> 
>>>> Hi all,
>>>> 
>>>> I am trying to implement  new conditions for CRIS and I’ve come around a
>>>> peculiar problem. I’ve create a “BoostCondition” based on the same
>>>> principle than the WildCardCondition. Here’s it’s Ctor and query method:
>>>> 
>>>> public BoostCondition(VirtualProperty property, String value, Float
>>>> boost) {
>>>>   this.property = property;
>>>>   this.value = value;
>>>>   this.boost = boost;
>>>> }
>>>> 
>>>> public BoostCondition(UriRef uriRefProperty, String value, Float
>> boost) {
>>>>   this(new PropertyHolder(uriRefProperty,false), value,boost);
>>>> }
>>>> 
>>>> @Override
>>>> protected Query query() {
>>>>   TermQuery termQuery = new TermQuery(new Term(property.getStringKey(),
>>>> value));
>>>>   termQuery.setBoost(boost);
>>>>   return termQuery;
>>>> }
>>>> 
>>>> Nothing fancy and here is how it is used:
>>>> 
>>>>      conditions.add(new BoostCondition(RDF.type, "<
>>>> http://www.patexpert.org/ontologies/pmo.owl#PatentPublication>", new
>>>> Float(0.5)));
>>>> 
>>>>       final List<NonLiteral> matchingNodes =
>>>> indexService.findResources(conditions, facetCollector);
>>>>       node.addPropertyValue(ECS.contentsCount, matchingNodes.size());
>>>> 
>>>> All is well EXCEPT that in CRIS it will look for the field ‘RDF.type’
>>>> while when indexed it is indexed as: “_STORED_”+RDF.type as per the
>>>> following lucene Query:
>>>> +J683e9b57eca321d4a268d4b24df62c9bfb7169b2:*sodium* +
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#type:<
>>>> http://www.patexpert.org/ontologies/pmo.owl#PatentPublication>^0.5
>>>> 
>>>> 
>>>> Attached is log with and without the custom condition
>>>> 
>>>> INDEXING
>>>> ==========
>>>> 
>>>> 13.01.2014 16:05:50.165 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer CRIS Reindex Thread[386]:
>> cache
>>>> full or writes have ceased. Indexing...
>>>> 13.01.2014 16:05:50.379 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing <
>>>> http://fusepool.info/doc/pmc/3470790> considering 3 properties
>>>> ([org.apache.clerezza.rdf.cris.PropertyHolder@b309d7c1,
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@f5e5585a,
>>>> org.apache.clerezza.rdf.cris.JoinVirtualProperty@cf3ff15e])
>>>> 13.01.2014 16:05:50.379 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@b309d7c1 with values 1
>>>> 13.01.2014 16:05:50.379 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@b309d7c1(
>>>> http://purl.org/dc/elements/1.1/subject) with value
>>>> http://fusepool.info/id/caa7fc7a-f024-47d8-925d-151eb8600b6b
>>>> 13.01.2014 16:05:50.379 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@f5e5585a with values 2
>>>> 13.01.2014 16:05:50.379 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@f5e5585a(
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#type) with value
>>>> http://fusepool.eu/ontologies/ecs#ContentItem
>>>> 13.01.2014 16:05:50.379 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@f5e5585a(
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#type) with value
>>>> http://purl.org/ontology/bibo/Document
>>>> 13.01.2014 16:05:50.380 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.JoinVirtualProperty@cf3ff15e with values 1
>>>> 13.01.2014 16:05:50.380 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.JoinVirtualProperty@cf3ff15e
>> (J683e9b57eca321d4a268d4b24df62c9bfb7169b2)
>>>> with value Two barriers for sodium in vascular endothelium? Vascular
>>>> endothelium plays a key role in blood pressure regulation. Recently, it
>> has
>>>> been shown that a 5% increase of plasma sodium concentration (sodium
>>>> excess) stiffens endothelial cells by about 25%, leading to cellular
>>>> dysfunction. Surface measurements demonstrated that the endothelial
>>>> glycocalyx (eGC), an anionic biopolymer, deteriorates when sodium is
>>>> elevated. In view of these results, a two-barrier model for sodium
>> exiting
>>>> the circulation across the endothelium is suggested. The first sodium
>>>> barrier is the eGC which selectively buffers sodium ions with its
>>>> negatively charged prote-oglycans.The second sodium barrier is the
>>>> endothelial plasma membrane which contains sodium channels. Sodium
>> excess,
>>>> in the presence of aldosterone, leads to eGC break-down and, in
>> parallel,
>>>> to an up-regulation of plasma membrane sodium channels. The following
>>>> hypothesis is postulated: Sodium excess increases vascular sodium
>>>> permeability. Under such con-ditions (e.g. high-sodium diet), day-by-day
>>>> ingested sodium, instead of being readily buffered by the eGC and then
>>>> rapidly excreted by the kidneys, is distributed in the whole body before
>>>> being finally excreted. Gradually, the sodium overload damages the
>> organism.
>>>> 13.01.2014 16:05:50.388 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing <
>>>> http://fusepool.info/doc/pmc/3581062> considering 3 properties
>>>> ([org.apache.clerezza.rdf.cris.PropertyHolder@b309d7c1,
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@f5e5585a,
>>>> org.apache.clerezza.rdf.cris.JoinVirtualProperty@cf3ff15e])
>>>> 13.01.2014 16:05:50.388 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@b309d7c1 with values 2
>>>> 13.01.2014 16:05:50.388 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@b309d7c1(
>>>> http://purl.org/dc/elements/1.1/subject) with value
>>>> http://fusepool.info/id/4cfa649e-5eca-4349-bbb5-f782b87089d4
>>>> 13.01.2014 16:05:50.388 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@b309d7c1(
>>>> http://purl.org/dc/elements/1.1/subject) with value
>>>> http://fusepool.info/id/f421cc4a-619c-4189-a3ef-3c2025e50ac9
>>>> 13.01.2014 16:05:50.388 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@f5e5585a with values 2
>>>> 13.01.2014 16:05:50.388 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@f5e5585a(
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#type) with value
>>>> http://fusepool.eu/ontologies/ecs#ContentItem
>>>> 13.01.2014 16:05:50.388 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.PropertyHolder@f5e5585a(
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#type) with value
>>>> http://purl.org/ontology/bibo/Document
>>>> 13.01.2014 16:05:50.388 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.JoinVirtualProperty@cf3ff15e with values 1
>>>> 13.01.2014 16:05:50.388 *INFO* [CRIS Reindex Thread[386]]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer indexing
>>>> org.apache.clerezza.rdf.cris.JoinVirtualProperty@cf3ff15e
>> (J683e9b57eca321d4a268d4b24df62c9bfb7169b2)
>>>> with value Diagnosis and treatment of mitochondrial myopathies
>>>> Mitochondrial disorders are a heterogeneous group of disorders resulting
>>>> from primary dysfunction of the respiratory chain. Muscle tissue is
>> highly
>>>> metabolically active, and therefore myopathy is a common element of the
>>>> clinical presentation of these disorders, although this may be
>> overshadowed
>>>> by central neurological features. This review is aimed at a general
>> medical
>>>> and neurologist readership and provides a clinical approach to the
>>>> recognition, investigation, and treatment of mitochondrial myopathies.
>>>> Emphasis is placed on practical management considerations while
>> including
>>>> some recent updates in the field.
>>>> 
>>>> 
>>>> 
>>>> SEARCH WITHOUT CUSTOM CONDITION
>>>> ===============================
>>>> 
>>>> 13.01.2014 16:07:48.343 *INFO* [627421185@qtp-612005121-38]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer luceneQuery:
>>>> +J683e9b57eca321d4a268d4b24df62c9bfb7169b2:*sodium*
>>>> 13.01.2014 16:07:48.361 *INFO* [627421185@qtp-612005121-38]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer _STORED_
>>>> http://purl.org/dc/elements/1.1/subject :
>>>> http://fusepool.info/id/caa7fc7a-f024-47d8-925d-151eb8600b6b
>>>> 13.01.2014 16:07:48.361 *INFO* [627421185@qtp-612005121-38]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer _STORED_
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#type :
>>>> http://fusepool.eu/ontologies/ecs#ContentItem
>>>> 13.01.2014 16:07:48.361 *INFO* [627421185@qtp-612005121-38]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer _STORED_
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#type :
>>>> http://purl.org/ontology/bibo/Document
>>>> 13.01.2014 16:07:48.361 *INFO* [627421185@qtp-612005121-38]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer
>>>> _STORED_J683e9b57eca321d4a268d4b24df62c9bfb7169b2 : Two barriers for
>> sodium
>>>> in vascular endothelium? Vascular endothelium plays a key role in blood
>>>> pressure regulation. Recently, it has been shown that a 5% increase of
>>>> plasma sodium concentration (sodium excess) stiffens endothelial cells
>> by
>>>> about 25%, leading to cellular dysfunction. Surface measurements
>>>> demonstrated that the endothelial glycocalyx (eGC), an anionic
>> biopolymer,
>>>> deteriorates when sodium is elevated. In view of these results, a
>>>> two-barrier model for sodium exiting the circulation across the
>> endothelium
>>>> is suggested. The first sodium barrier is the eGC which selectively
>> buffers
>>>> sodium ions with its negatively charged prote-oglycans.The second sodium
>>>> barrier is the endothelial plasma membrane which contains sodium
>> channels.
>>>> Sodium excess, in the presence of aldosterone, leads to eGC break-down
>> and,
>>>> in parallel, to an up-regulation of plasma membrane sodium channels. The
>>>> following hypothesis is postulated: Sodium excess increases vascular
>> sodium
>>>> permeability. Under such con-ditions (e.g. high-sodium diet), day-by-day
>>>> ingested sodium, instead of being readily buffered by the eGC and then
>>>> rapidly excreted by the kidneys, is distributed in the whole body before
>>>> being finally excreted. Gradually, the sodium overload damages the
>> organism.
>>>> 13.01.2014 16:07:48.361 *INFO* [627421185@qtp-612005121-38]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer resource-uri :
>>>> http://fusepool.info/doc/pmc/3470790
>>>> 13.01.2014 16:07:48.361 *INFO* [627421185@qtp-612005121-38]
>>>> org.apache.stanbol.entityhub.core.impl.SiteManagerImpl No Referenced
>> Site
>>>> registered for Entity
>>>> http://fusepool.info/id/caa7fc7a-f024-47d8-925d-151eb8600b6b
>>>> 13.01.2014 16:07:48.361 *INFO* [627421185@qtp-612005121-38]
>>>> org.apache.stanbol.entityhub.core.impl.SiteManagerImpl No Referenced
>> Site
>>>> registered for Entity http://purl.org/ontology/bibo/Document
>>>> 
>>>> 
>>>> SEARCH WITH CUSTOM CONDITION
>>>> ============================
>>>> 
>>>> 13.01.2014 16:14:32.746 *INFO* [806435093@qtp-612005121-40]
>>>> org.apache.clerezza.rdf.cris.GraphIndexer luceneQuery:
>>>> +J683e9b57eca321d4a268d4b24df62c9bfb7169b2:*sodium* +
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#type:<
>>>> http://www.patexpert.org/ontologies/pmo.owl#PatentPublication>^0.5
>>>> 
>>>> 
>> 
>> 

Reply via email to