[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-06-03 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-108237260
  
Sorry your having problems - that should have worked when you press 
commit.  Just send the modified markdown to me directly and I'll merge the 
files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-06-03 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-108383450
  
I have created a gist instead : 
https://gist.github.com/amiara514/262d73ed35580b7dbdfe


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-06-03 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-108376304
  
Ok, which address ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-06-03 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-108387604
  
That works great - thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-06-02 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-108096782
  
@amiara514 -- I'm not seeing a message from you on dev@.  Did it get sent?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-06-02 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-108118228
  
I sent it from the edition page (via  Improve this Page) with anonymous 
access.
The default setting is a post to dev@jena.apache.org.

Another way to send the file ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-06-02 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-107989314
  
@afs : done, I modified the text-query.mdtext and sent it to the dev list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-06-01 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-107572459
  
 The direct reflection of  site/trunk  is http://jena.staging.apache.org/ 
and the button should work there as well. 

Humm, this link leads to the same wrong place...
Ex: the section Configuring an Analyzer should finish with the 
explanation about specifiyng a query analyzer (New in Jena 2.13.0 is the 
optional ability to specify...) as the online version, and it's not the case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-06-01 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-107605937
  
Hmm indeed.  I'll trying making a small change to see if it flushes the 
correct mdtext file.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-06-01 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-107519936
  
Thank for pointing that out. The direct reflection of `site/trunk` is 
http://jena.staging.apache.org/ and the button should work there as well.  The 
website gets published explicitly and we have probably not published it as 
Jena3 mentions get into the content.  Sorry for any confusion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-29 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-106875094
  
Yes - adding the version info is more nice than essential.

(Really, we ought to undo references to really old stuff.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-27 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-106037240
  
Hi Andy,
I'm watching on the documentation part about linguistic stuff.
On the current doc, there are some references like : 
Starting with version 1.0.1, jena-text... or New in Jena 2.13.0 is 
the

Should I introduce my new changes with a reference of the next 3.0.0 
version of Jena ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/jena/pull/64


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-25 Thread osma
Github user osma commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-105229534
  
Excellent!

I played a bit with this code. I noticed one potential issue: if I do a 
language specific query with a text:query argument such as 'lang:en', but there 
is not langField set for the index, the query parameter will just be silently 
ignored. Would it be better to return some kind of query error instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-25 Thread osma
Github user osma commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-105236007
  
@afs: I reviewed your [EntityDefinition 
additions](https://github.com/apache/jena/commit/66a1eda822d8f551fac06d6b0a2672decdc2)
 and they look fine to me. But should they be `@Deprecated`?

Alex's code also removes some methods from TextDatasetFactory. Should 
compatibility methods (`@Deprecated` perhaps) be added back there as well? 
Namely these:

```java
public static TextIndex createLuceneIndex(Directory directory, 
EntityDefinition def, Analyzer queryAnalyzer)
public static Dataset createLucene(Dataset base, Directory directory, 
EntityDefinition def, Analyzer queryAnalyzer)
public static DatasetGraph createLucene(DatasetGraph base, Directory 
directory, EntityDefinition def, Analyzer queryAnalyzer)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-25 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-105237062
  
All seems to be ok with the merge, thanks.

I'll try to make the doc by the end of the week.
A question about this... it will be online only when jena3 will be 
available, right ?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-25 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-105237426
  
I played a bit with this code. I noticed one potential issue: if I do a 
language specific query with a text:query argument such as 'lang:en', but there 
is not langField set for the index, the query parameter will just be silently 
ignored. Would it be better to return some kind of query error instead?

The documentation may be sufficient. no ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-22 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-104784383
  
git apply didn't see those! No matter - the PR is otherwise fine.

When the `EntityDefinition` the migration constructor (which can be 
@Deprecated -- just trying to minimise the bump to jena3) is there, we can 
merge it for further comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-22 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-104762759
  
I see deletion in /64.patch :

Date: Tue, 19 May 2015 14:41:32 -0400
Subject: [PATCH 3/3] langField implementation to store lang tags of 
literals +
 refactoring growing methods of TextDatasetFactory
...
 delete mode 100644 
jena-text/src/main/java/org/apache/jena/query/text/LuceneUtil.java
 create mode 100644 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexConfig.java
 create mode 100644 
jena-text/src/main/java/org/apache/jena/query/text/analyzer/Util.java
 delete mode 100644 
jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexLuceneMultilingualAssembler.java
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-22 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-104760829
  
The diff and patch are supposed to be the same end result, even if 
cumulative commits. The patch does not seem to have the later deletes.  A 
bug(let) in github perhaps?

I use them to look at a PR before pulling it - it shows what's changed 
better (for me, in Eclipse) without changing git.  For the real thing, I pull 
requests if taking all of it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-22 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-104755152
  
I'm working from the pull request (`.../64.patch`). Strange indeed  : the 
`.diff` is good, the `.patch` is bad.

The website is at https://svn.apache.org/repos/asf/jena/site/trunk/ and the 
source of query.html is from the markdown 
https://svn.apache.org/repos/asf/jena/site/trunk/content/documentation/query/text-query.mdtext.
 

Either go to the HTML page and click Improve this Page or send the 
updated mdtext to dev@ and I can incorporate into the site.  No svn directly 
involved :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-22 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-104704987
  
Generally looks good.

I tried applying it as a patch and `TextIndexLuceneMultilingualAssembler` 
uses non-existent `TextDatasetFactory.createLuceneIndexMultilingual`. Is this 
PR dependent on another?

There is document at 
[query/text-query](http://jena.apache.org/documentation/query/text-query.html). 
In what way should that be updated?

Classes `analyzer.Util` and `LuceneUtil` appear to be the same code.

Would it be possible to put back `EntityDefinition` for maximum 
compatibility with the original code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-22 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-104720469
  
I tried applying it as a patch and  TextIndexLuceneMultilingualAssembler  
uses non-existent  TextDatasetFactory.createLuceneIndexMultilingual . Is this 
PR dependent on another?

Strange, `TextIndexLuceneMultilingualAssembler` has been deleted... 
`LuceneUtil` too. 
Which source code branch do you use ?

Would it be possible to put back  EntityDefinition  for maximum 
compatibility with the original code?

Ok, I'll do it soon.

There is document at query/text-query. In what way should that be updated?

Should I put paragraph explanations on this discussion or elsewhere ?
It will be a re-format of the [dev 
mail](http://mail-archives.apache.org/mod_mbox/jena-dev/201505.mbox/raw/%3CBLU181-W56D79F6BA6AC7103317BC6ECC20%40phx.gbl%3E/1/2)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-21 Thread rvesse
Github user rvesse commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-104211029
  
+1 LGTM

Pinging @Stephen-Allen and @ehedgehog who have previously developed a lot 
of this code for further review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-20 Thread osma
Github user osma commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-103778663
  
 with a minus before to exclude filled values

Ah, I see. Sorry for the confusion. I had different expectations and the 
expression is a bit hard to read. I think this is fine, though it could be a 
bit clearer. And I now see that you also test for 'lang:none'.

 Is there a required formatting on the message ?
 ps: I will also mention the refactoring to obtain their advices on it.

No, just explain what this is about. Copying and pasting the still-relevant 
explanations from your comments here should work.

I know these comments already get echoed to the dev list, but it's probably 
difficult to follow since we've been discussing this quite a lot already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-19 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-103661706
  
1. Not exactly, let's suppose that langField is used with name language.
If lang arg == none in the Sparql query, then the query extension for 
Lucene will be :
```
String qs2 = -language:*
```
with a minus before to exclude filled values

2. Ok I will submit it to the dev list. 
Is there a required formatting on on the message ?
ps: I will also mention the refactoring to obtain their advices on it.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-19 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-103635153
  
Languages can now be stored in the index by providing a langField param to 
EntityDefinition (assembler and java code). So you can use a TexIndexLucene 
with your own analyzer and target language of localized literals.

3 cases on Sparql clauses : 
```?s text:query (rdfs:label 'word' 'lang:en' )``` will target english 
literals
```?s text:query (rdfs:label 'word' 'lang:none')``` will target unlocalized 
literals
```?s text:query (rdfs:label 'word')``` will ignore language
NOTE: in Sparql queries, ```lang``` is a predefined keyword and the Lucene 
query will be mapped with the right langField name.  

Moreover, TextIndexConfig class has been introduced and EntityDefinition 
refactored to simplify TextDatasetFactory as desired.

LuceneUtil has moved as Util class in analyzer package. It still used with 
the LocalizedAnalyzerAssembler


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-19 Thread osma
Github user osma commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-103646019
  
Excellent work! Looks almost ready for merging.

A couple of comments:

1.You say that ?s text:query (rdfs:label 'word' 'lang:none') will target 
unlocalized literals. But the code does this:

```java
String qs2 = !none.equals(langArg)?
field + : + langArg : - + field + :*;
```

In my understanding, this will do a wildcard search on the language field 
if the query argument is lang:none. So it wouldn't be possible to search for 
non-language-tagged literals. Unless I'm missing something, I'd replace the 
above expression with simply this line:

```java
 String qs2 = field + : + langArg;
```

2.You have now removed some of the old constructors from EntityDefinition 
and the create* methods from TextDatasetFactory (and had to change the example 
code and some unit tests accordingly). While this leads to cleaner code, it 
also breaks the API so old code has to be changed. This might actually be a 
good moment to do it, as Jena has just bumped the version number to 3 and other 
APIs are also being changed, but I think we need opinions on that. If it turns 
out to be too disruptive, then compatibility methods could be added back.

Could you write a message to the dev@ list and explain your contribution? I 
could then join in with comments in favor of merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-15 Thread osma
Github user osma commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-102297487
  
Thanks! I think decoupling these would be a good thing for other purposes 
too. For example, I have some plans to propose storing (optionally) the full 
literal values in the Lucene index, so that they can be used in SPARQL queries, 
and having the language tags in the index would help a bit with that.

Still a couple more proposals:

1. Your new LuceneUtil class appears to only contain (static) methods 
related to the multilingual analyzer. I propose moving those methods to 
TextIndexLuceneMultilingual and removing the LuceneUtil class, if there is no 
other use for it.
2. The TextDatasetFactory methods for creating indexes are getting more and 
more convoluted as new parameters and variations are added. (I'm also guilty of 
this - when I added graphField support, I just made new versions of the methods 
with more parameters, but left in the old ones too for compatibility.) I think 
that it would make sense to introduce a new class TextIndexConfiguration which 
could be used to set only those parameters that are relevant to the use, 
something like this:

```java
TextIndexConfiguration conf = new TextIndexConfiguration(entDef);
conf.setAnalyzer(analyzer);
conf.setGraphField(graph);
conf.setLanguageField(lang);
Dataset dsLucene = TextDatasetFactory.createLucene(dsBase, textdir, conf);
```

Using a multilingual index could perhaps be a boolean flag (since it's not 
just a standard analyzer), like this:

```java
conf.setMultilingualAnalyzer(true);
```

This way, I think we could avoid blowing up the number of createLucene and 
createLuceneIndex methods further. There would only be one more variant of each 
method in TextDatasetFactory, taking a TextIndexConfiguration parameter. When 
new features are added (such as langField and multilingual indexing), these can 
be added to TextIndexConfiguration but there is no need for further 
createLucene-style methods. createLuceneIndexMultilingual could be dropped.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-15 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-102398852
  
Yes completely agree with that, otherwise TextDatasetFactory will be 
difficult to maintain.
And no more need of TextIndexLuceneMultilingualAssembler also.

TextIndexConfiguration should be about : analyzer, queryAnalyzer, 
multilingual,.. and graphField, langField concern EntityDefinition no ?
Moreover, to avoid the same growing constructors, EntityDefinition should 
be configurable with setters/getters.


Just a question about the usual worflow, who decides to integrate, or not, 
a proposed code ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-15 Thread osma
Github user osma commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-102406631
  
 TextIndexConfiguration should be about : analyzer, queryAnalyzer, 
multilingual,.. and graphField, langField concern EntityDefinition no ?
 Moreover, to avoid the same growing constructors, EntityDefinition should 
be configurable with setters/getters.

Oh yes, sorry, I was a bit confused about TextIndexFactory methods vs. 
EntityDefinition. Indeed the field-specific configuration belongs in 
EntityDefinition and I agree it would make sense to add setters there instead 
of yet more constructors. 

As it turns out the TextIndexConfiguration class I suggested will only be 
concerned with analyzer configuration, maybe it could be called 
TextIndexAnalyzerConfiguration or something like that.

 Just a question about the usual worflow, who decides to integrate, or 
not, a proposed code ?

The final decision is made by the core maintainers, often Andy Seaborne 
(@afs). But I have tried to help along to make this ready (Andy actually asked 
me for help as he was busy with the Jena3 transition) and I think other people 
with an interest in jena-text could also give their opinion.

I think this is beginning to look like ready for merging - assuming the 
changes discussed above will happen. When those are done I think it would make 
sense to ask for final comments on the dev@jena.apache.org mailing list, since 
few people have commented here on GitHub but I know there are people like 
myself on the list who make use of jena-text. The only controversial issue I 
can think of right now is the parameter format for the text:query property 
function - the lang:xx parameter looks a bit hacky to me, and while I can't 
suggest anything else, somebody else might.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-15 Thread osma
Github user osma commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-102413569
  
 I prefer TextIndexConfiguration, it will be easier to add future conf 
parameters.

Ok.

 About the lang:xx, I think that extra params should be generalized in the 
same manner, limit:10, score:x,... Hence it would allow params to be 
optional and would remove the order and size constraints.

I see the point. I don't know if there's another way to provide optional 
named parameters to ARQ property functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-15 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-102412893
  
As it turns out the TextIndexConfiguration class I suggested will only be 
concerned with analyzer configuration, maybe it could be called 
TextIndexAnalyzerConfiguration or something like that.

I prefer TextIndexConfiguration, it will be easier to add future conf 
parameters.

Thanks for the explanation.

About the lang:xx, I think that extra params should be generalized in the 
same manner, limit:10, score:x,... Hence it would allow params to be 
optional and would remove the order and size constraints.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-14 Thread osma
Github user osma commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-102144875
  
Hi, I see that you already changed your code - impressive work!

One more suggestion - I hope it won't come too late, since you've already 
moved code from TextIndexLucene to TextIndexLuceneMultilingual like I 
suggested...

I've been thinking about how I could make use of this in my application 
(Skosmos). I don't have a need for language-specific analyzers 
(LowerCaseKeywordAnalyzer works better for the application, in all languages), 
but it could be useful to be able to target searches based on language - this 
way I could avoid some false hits from the text index and thus get faster 
queries overall. So I wonder if it would be possible to separate these two 
aspects:

1. Store language tags of literals in the Lucene index and be able to 
restrict the query to a specific language with a query parameter
2. Use different analyzers for different languages

Right now your code does both, but it's not possible to do only 1. 
Obviously 2 depends on 1.

How about adding a new option langField, similar to graphField, that 
can be configured via the assembler (or as a constructor parameter, just like 
graphField). When set to the name of a field (the obvious choice would be 
lang), the language tags would get stored in the index, and it would be 
possible to target queries for a specific language. This would already be base 
functionality for TextIndexLucene (sorry - I know you just moved the code 
away!).

Then the MultilingualAnalyzer, implemented in TextIndexLuceneMultilingual, 
would depend on having langField set and would actually cause the analyzer to 
be dynamically selected based on language.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-14 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-102207391
  
Ok, it's not supposed to be a big job. I'll take a look soon.
For the multilingual analyzer, the lang must be stored anyway, it depends 
on it (like you said in point 2). So, either the langField param will be 
ignored or an exception will be raised to alert the forgotten field ?

Another point :
In the current version, I put an undef value rather than an empty one for 
the unlocalized literals. Because the query on empty field is not obvious with 
Lucene and I want to be able to search unlocalized values in explicit way. In 
your case, I don't think it will be a necessity.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-13 Thread amiara514
Github user amiara514 commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-101808418
  
Ok, I will separate it to have a distinct behavior. 
Removing the isMultilingual() method will probably force to switch some 
private fields and/or methods to protected.

 So you should never see a literal such as book@eng in well-formed RDF 
data.

Yes unfortunaltely. It was to accept this case. Ok, if the strict 
implementation about standards is sufficient, I will bypass the 3 to 2 letters 
conversion.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-13 Thread osma
Github user osma commented on the pull request:

https://github.com/apache/jena/pull/64#issuecomment-101780700
  
This looks very promising. As I've said before, I think a single index is 
preferable to multiple indexes. It should require less book-keeping overall.

I have to say I don't like the current division of labor between 
TextIndexLucene and TextIndexLuceneMultilingual, with the latter having an 
isMultilingual() method that returns true, and all the actual multilingual 
support in TextIndexLucene.

I think that ideally, TextIndexLucene should not be in any way aware of the 
multilingual aspects and should merely provide hooks or extension/override 
points that can then be used by TextIndexLuceneMultilingual to provide 
multilingual functionality.

As an example, instead of this (in TextIndexLucene around line 230):

```java
Analyzer analyzer = null;
if (isMultilingual())
analyzer = 
LuceneUtil.getLocalizedAnalyzer(entity.getLanguage());

if (analyzer != null)
indexWriter.addDocument(doc, analyzer) ;
else //use the default one
indexWriter.addDocument(doc) ;
```

you could have something like this:

```java
/* at the end of TextIndexLucene.addEntity() */
  addDocument(entity);
  }

  /* extension point; can be overridden if necessary */
  protected void addDocument(Entity entity) {
   Document doc = doc(entity);
   indexWriter.addDocument(doc);
  }

/* in TextIndexLuceneMultilingual */

  @Override
  protected void addDocument(Entity entity) {
   Document doc = doc(entity);
   Analyzer analyzer = 
LuceneUtil.getLocalizedAnalyzer(entity.getLanguage());
   indexWriter.addDocument(doc, analyzer);
  }
```

I understand that compromises sometimes have to be made, but putting all 
the multilingual support code in the TextIndexLucene class is, I think, pushing 
it too far. It also makes it hard to extend TextIndexLucene for other purposes.

Another thing: I don't quite understand the need to handle 3 letter 
language tags. AFAIU the relevant standards (RDF 1.0, XML 1.0, RDF 1.1) all 
specify that language tags should be formatted according to BCP 47 / RFC 4646 
which, I think, states that when a 2 letter ISO 639-1 language tag exists, it 
should be used instead of the equivalent 3-letter ISO 639-2 or ISO 639-3 codes. 
So you should never see a literal such as book@eng in well-formed RDF data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Jena-text multilingual alternative implementati...

2015-05-13 Thread amiara514
GitHub user amiara514 opened a pull request:

https://github.com/apache/jena/pull/64

Jena-text multilingual alternative implementation

This proposal is an alternative of 
[pull52](https://github.com/apache/jena/pull/52) (JENA-928).
It deals with a single index that includes language as a new field entry.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LICEF/jena jena-text-ml-single-index

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/jena/pull/64.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #64


commit 9553c6b2c246bc9c05906096c1f56d65ba15fed8
Author: Alexis Miara alexis_mi...@hotmail.com
Date:   2015-05-13T15:23:56Z

Implementation of jena-text multilingual with a single index




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---