[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user afs commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-108237260 Sorry your having problems - that should have worked when you press commit. Just send the modified markdown to me directly and I'll merge the files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-108383450 I have created a gist instead : https://gist.github.com/amiara514/262d73ed35580b7dbdfe --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-108376304 Ok, which address ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user afs commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-108387604 That works great - thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user afs commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-108096782 @amiara514 -- I'm not seeing a message from you on dev@. Did it get sent? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-108118228 I sent it from the edition page (via Improve this Page) with anonymous access. The default setting is a post to dev@jena.apache.org. Another way to send the file ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-107989314 @afs : done, I modified the text-query.mdtext and sent it to the dev list. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-107572459 The direct reflection of site/trunk is http://jena.staging.apache.org/ and the button should work there as well. Humm, this link leads to the same wrong place... Ex: the section Configuring an Analyzer should finish with the explanation about specifiyng a query analyzer (New in Jena 2.13.0 is the optional ability to specify...) as the online version, and it's not the case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user afs commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-107605937 Hmm indeed. I'll trying making a small change to see if it flushes the correct mdtext file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user afs commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-107519936 Thank for pointing that out. The direct reflection of `site/trunk` is http://jena.staging.apache.org/ and the button should work there as well. The website gets published explicitly and we have probably not published it as Jena3 mentions get into the content. Sorry for any confusion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user afs commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-106875094 Yes - adding the version info is more nice than essential. (Really, we ought to undo references to really old stuff.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-106037240 Hi Andy, I'm watching on the documentation part about linguistic stuff. On the current doc, there are some references like : Starting with version 1.0.1, jena-text... or New in Jena 2.13.0 is the Should I introduce my new changes with a reference of the next 3.0.0 version of Jena ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user asfgit closed the pull request at: https://github.com/apache/jena/pull/64 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user osma commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-105229534 Excellent! I played a bit with this code. I noticed one potential issue: if I do a language specific query with a text:query argument such as 'lang:en', but there is not langField set for the index, the query parameter will just be silently ignored. Would it be better to return some kind of query error instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user osma commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-105236007 @afs: I reviewed your [EntityDefinition additions](https://github.com/apache/jena/commit/66a1eda822d8f551fac06d6b0a2672decdc2) and they look fine to me. But should they be `@Deprecated`? Alex's code also removes some methods from TextDatasetFactory. Should compatibility methods (`@Deprecated` perhaps) be added back there as well? Namely these: ```java public static TextIndex createLuceneIndex(Directory directory, EntityDefinition def, Analyzer queryAnalyzer) public static Dataset createLucene(Dataset base, Directory directory, EntityDefinition def, Analyzer queryAnalyzer) public static DatasetGraph createLucene(DatasetGraph base, Directory directory, EntityDefinition def, Analyzer queryAnalyzer) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-105237062 All seems to be ok with the merge, thanks. I'll try to make the doc by the end of the week. A question about this... it will be online only when jena3 will be available, right ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-105237426 I played a bit with this code. I noticed one potential issue: if I do a language specific query with a text:query argument such as 'lang:en', but there is not langField set for the index, the query parameter will just be silently ignored. Would it be better to return some kind of query error instead? The documentation may be sufficient. no ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user afs commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-104784383 git apply didn't see those! No matter - the PR is otherwise fine. When the `EntityDefinition` the migration constructor (which can be @Deprecated -- just trying to minimise the bump to jena3) is there, we can merge it for further comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-104762759 I see deletion in /64.patch : Date: Tue, 19 May 2015 14:41:32 -0400 Subject: [PATCH 3/3] langField implementation to store lang tags of literals + refactoring growing methods of TextDatasetFactory ... delete mode 100644 jena-text/src/main/java/org/apache/jena/query/text/LuceneUtil.java create mode 100644 jena-text/src/main/java/org/apache/jena/query/text/TextIndexConfig.java create mode 100644 jena-text/src/main/java/org/apache/jena/query/text/analyzer/Util.java delete mode 100644 jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexLuceneMultilingualAssembler.java --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user afs commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-104760829 The diff and patch are supposed to be the same end result, even if cumulative commits. The patch does not seem to have the later deletes. A bug(let) in github perhaps? I use them to look at a PR before pulling it - it shows what's changed better (for me, in Eclipse) without changing git. For the real thing, I pull requests if taking all of it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user afs commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-104755152 I'm working from the pull request (`.../64.patch`). Strange indeed : the `.diff` is good, the `.patch` is bad. The website is at https://svn.apache.org/repos/asf/jena/site/trunk/ and the source of query.html is from the markdown https://svn.apache.org/repos/asf/jena/site/trunk/content/documentation/query/text-query.mdtext. Either go to the HTML page and click Improve this Page or send the updated mdtext to dev@ and I can incorporate into the site. No svn directly involved :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user afs commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-104704987 Generally looks good. I tried applying it as a patch and `TextIndexLuceneMultilingualAssembler` uses non-existent `TextDatasetFactory.createLuceneIndexMultilingual`. Is this PR dependent on another? There is document at [query/text-query](http://jena.apache.org/documentation/query/text-query.html). In what way should that be updated? Classes `analyzer.Util` and `LuceneUtil` appear to be the same code. Would it be possible to put back `EntityDefinition` for maximum compatibility with the original code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-104720469 I tried applying it as a patch and TextIndexLuceneMultilingualAssembler uses non-existent TextDatasetFactory.createLuceneIndexMultilingual . Is this PR dependent on another? Strange, `TextIndexLuceneMultilingualAssembler` has been deleted... `LuceneUtil` too. Which source code branch do you use ? Would it be possible to put back EntityDefinition for maximum compatibility with the original code? Ok, I'll do it soon. There is document at query/text-query. In what way should that be updated? Should I put paragraph explanations on this discussion or elsewhere ? It will be a re-format of the [dev mail](http://mail-archives.apache.org/mod_mbox/jena-dev/201505.mbox/raw/%3CBLU181-W56D79F6BA6AC7103317BC6ECC20%40phx.gbl%3E/1/2) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user rvesse commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-104211029 +1 LGTM Pinging @Stephen-Allen and @ehedgehog who have previously developed a lot of this code for further review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user osma commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-103778663 with a minus before to exclude filled values Ah, I see. Sorry for the confusion. I had different expectations and the expression is a bit hard to read. I think this is fine, though it could be a bit clearer. And I now see that you also test for 'lang:none'. Is there a required formatting on the message ? ps: I will also mention the refactoring to obtain their advices on it. No, just explain what this is about. Copying and pasting the still-relevant explanations from your comments here should work. I know these comments already get echoed to the dev list, but it's probably difficult to follow since we've been discussing this quite a lot already. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-103661706 1. Not exactly, let's suppose that langField is used with name language. If lang arg == none in the Sparql query, then the query extension for Lucene will be : ``` String qs2 = -language:* ``` with a minus before to exclude filled values 2. Ok I will submit it to the dev list. Is there a required formatting on on the message ? ps: I will also mention the refactoring to obtain their advices on it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-103635153 Languages can now be stored in the index by providing a langField param to EntityDefinition (assembler and java code). So you can use a TexIndexLucene with your own analyzer and target language of localized literals. 3 cases on Sparql clauses : ```?s text:query (rdfs:label 'word' 'lang:en' )``` will target english literals ```?s text:query (rdfs:label 'word' 'lang:none')``` will target unlocalized literals ```?s text:query (rdfs:label 'word')``` will ignore language NOTE: in Sparql queries, ```lang``` is a predefined keyword and the Lucene query will be mapped with the right langField name. Moreover, TextIndexConfig class has been introduced and EntityDefinition refactored to simplify TextDatasetFactory as desired. LuceneUtil has moved as Util class in analyzer package. It still used with the LocalizedAnalyzerAssembler --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user osma commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-103646019 Excellent work! Looks almost ready for merging. A couple of comments: 1.You say that ?s text:query (rdfs:label 'word' 'lang:none') will target unlocalized literals. But the code does this: ```java String qs2 = !none.equals(langArg)? field + : + langArg : - + field + :*; ``` In my understanding, this will do a wildcard search on the language field if the query argument is lang:none. So it wouldn't be possible to search for non-language-tagged literals. Unless I'm missing something, I'd replace the above expression with simply this line: ```java String qs2 = field + : + langArg; ``` 2.You have now removed some of the old constructors from EntityDefinition and the create* methods from TextDatasetFactory (and had to change the example code and some unit tests accordingly). While this leads to cleaner code, it also breaks the API so old code has to be changed. This might actually be a good moment to do it, as Jena has just bumped the version number to 3 and other APIs are also being changed, but I think we need opinions on that. If it turns out to be too disruptive, then compatibility methods could be added back. Could you write a message to the dev@ list and explain your contribution? I could then join in with comments in favor of merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user osma commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-102297487 Thanks! I think decoupling these would be a good thing for other purposes too. For example, I have some plans to propose storing (optionally) the full literal values in the Lucene index, so that they can be used in SPARQL queries, and having the language tags in the index would help a bit with that. Still a couple more proposals: 1. Your new LuceneUtil class appears to only contain (static) methods related to the multilingual analyzer. I propose moving those methods to TextIndexLuceneMultilingual and removing the LuceneUtil class, if there is no other use for it. 2. The TextDatasetFactory methods for creating indexes are getting more and more convoluted as new parameters and variations are added. (I'm also guilty of this - when I added graphField support, I just made new versions of the methods with more parameters, but left in the old ones too for compatibility.) I think that it would make sense to introduce a new class TextIndexConfiguration which could be used to set only those parameters that are relevant to the use, something like this: ```java TextIndexConfiguration conf = new TextIndexConfiguration(entDef); conf.setAnalyzer(analyzer); conf.setGraphField(graph); conf.setLanguageField(lang); Dataset dsLucene = TextDatasetFactory.createLucene(dsBase, textdir, conf); ``` Using a multilingual index could perhaps be a boolean flag (since it's not just a standard analyzer), like this: ```java conf.setMultilingualAnalyzer(true); ``` This way, I think we could avoid blowing up the number of createLucene and createLuceneIndex methods further. There would only be one more variant of each method in TextDatasetFactory, taking a TextIndexConfiguration parameter. When new features are added (such as langField and multilingual indexing), these can be added to TextIndexConfiguration but there is no need for further createLucene-style methods. createLuceneIndexMultilingual could be dropped. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-102398852 Yes completely agree with that, otherwise TextDatasetFactory will be difficult to maintain. And no more need of TextIndexLuceneMultilingualAssembler also. TextIndexConfiguration should be about : analyzer, queryAnalyzer, multilingual,.. and graphField, langField concern EntityDefinition no ? Moreover, to avoid the same growing constructors, EntityDefinition should be configurable with setters/getters. Just a question about the usual worflow, who decides to integrate, or not, a proposed code ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user osma commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-102406631 TextIndexConfiguration should be about : analyzer, queryAnalyzer, multilingual,.. and graphField, langField concern EntityDefinition no ? Moreover, to avoid the same growing constructors, EntityDefinition should be configurable with setters/getters. Oh yes, sorry, I was a bit confused about TextIndexFactory methods vs. EntityDefinition. Indeed the field-specific configuration belongs in EntityDefinition and I agree it would make sense to add setters there instead of yet more constructors. As it turns out the TextIndexConfiguration class I suggested will only be concerned with analyzer configuration, maybe it could be called TextIndexAnalyzerConfiguration or something like that. Just a question about the usual worflow, who decides to integrate, or not, a proposed code ? The final decision is made by the core maintainers, often Andy Seaborne (@afs). But I have tried to help along to make this ready (Andy actually asked me for help as he was busy with the Jena3 transition) and I think other people with an interest in jena-text could also give their opinion. I think this is beginning to look like ready for merging - assuming the changes discussed above will happen. When those are done I think it would make sense to ask for final comments on the dev@jena.apache.org mailing list, since few people have commented here on GitHub but I know there are people like myself on the list who make use of jena-text. The only controversial issue I can think of right now is the parameter format for the text:query property function - the lang:xx parameter looks a bit hacky to me, and while I can't suggest anything else, somebody else might. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user osma commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-102413569 I prefer TextIndexConfiguration, it will be easier to add future conf parameters. Ok. About the lang:xx, I think that extra params should be generalized in the same manner, limit:10, score:x,... Hence it would allow params to be optional and would remove the order and size constraints. I see the point. I don't know if there's another way to provide optional named parameters to ARQ property functions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-102412893 As it turns out the TextIndexConfiguration class I suggested will only be concerned with analyzer configuration, maybe it could be called TextIndexAnalyzerConfiguration or something like that. I prefer TextIndexConfiguration, it will be easier to add future conf parameters. Thanks for the explanation. About the lang:xx, I think that extra params should be generalized in the same manner, limit:10, score:x,... Hence it would allow params to be optional and would remove the order and size constraints. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user osma commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-102144875 Hi, I see that you already changed your code - impressive work! One more suggestion - I hope it won't come too late, since you've already moved code from TextIndexLucene to TextIndexLuceneMultilingual like I suggested... I've been thinking about how I could make use of this in my application (Skosmos). I don't have a need for language-specific analyzers (LowerCaseKeywordAnalyzer works better for the application, in all languages), but it could be useful to be able to target searches based on language - this way I could avoid some false hits from the text index and thus get faster queries overall. So I wonder if it would be possible to separate these two aspects: 1. Store language tags of literals in the Lucene index and be able to restrict the query to a specific language with a query parameter 2. Use different analyzers for different languages Right now your code does both, but it's not possible to do only 1. Obviously 2 depends on 1. How about adding a new option langField, similar to graphField, that can be configured via the assembler (or as a constructor parameter, just like graphField). When set to the name of a field (the obvious choice would be lang), the language tags would get stored in the index, and it would be possible to target queries for a specific language. This would already be base functionality for TextIndexLucene (sorry - I know you just moved the code away!). Then the MultilingualAnalyzer, implemented in TextIndexLuceneMultilingual, would depend on having langField set and would actually cause the analyzer to be dynamically selected based on language. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-102207391 Ok, it's not supposed to be a big job. I'll take a look soon. For the multilingual analyzer, the lang must be stored anyway, it depends on it (like you said in point 2). So, either the langField param will be ignored or an exception will be raised to alert the forgotten field ? Another point : In the current version, I put an undef value rather than an empty one for the unlocalized literals. Because the query on empty field is not obvious with Lucene and I want to be able to search unlocalized values in explicit way. In your case, I don't think it will be a necessity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user amiara514 commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-101808418 Ok, I will separate it to have a distinct behavior. Removing the isMultilingual() method will probably force to switch some private fields and/or methods to protected. So you should never see a literal such as book@eng in well-formed RDF data. Yes unfortunaltely. It was to accept this case. Ok, if the strict implementation about standards is sufficient, I will bypass the 3 to 2 letters conversion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
Github user osma commented on the pull request: https://github.com/apache/jena/pull/64#issuecomment-101780700 This looks very promising. As I've said before, I think a single index is preferable to multiple indexes. It should require less book-keeping overall. I have to say I don't like the current division of labor between TextIndexLucene and TextIndexLuceneMultilingual, with the latter having an isMultilingual() method that returns true, and all the actual multilingual support in TextIndexLucene. I think that ideally, TextIndexLucene should not be in any way aware of the multilingual aspects and should merely provide hooks or extension/override points that can then be used by TextIndexLuceneMultilingual to provide multilingual functionality. As an example, instead of this (in TextIndexLucene around line 230): ```java Analyzer analyzer = null; if (isMultilingual()) analyzer = LuceneUtil.getLocalizedAnalyzer(entity.getLanguage()); if (analyzer != null) indexWriter.addDocument(doc, analyzer) ; else //use the default one indexWriter.addDocument(doc) ; ``` you could have something like this: ```java /* at the end of TextIndexLucene.addEntity() */ addDocument(entity); } /* extension point; can be overridden if necessary */ protected void addDocument(Entity entity) { Document doc = doc(entity); indexWriter.addDocument(doc); } /* in TextIndexLuceneMultilingual */ @Override protected void addDocument(Entity entity) { Document doc = doc(entity); Analyzer analyzer = LuceneUtil.getLocalizedAnalyzer(entity.getLanguage()); indexWriter.addDocument(doc, analyzer); } ``` I understand that compromises sometimes have to be made, but putting all the multilingual support code in the TextIndexLucene class is, I think, pushing it too far. It also makes it hard to extend TextIndexLucene for other purposes. Another thing: I don't quite understand the need to handle 3 letter language tags. AFAIU the relevant standards (RDF 1.0, XML 1.0, RDF 1.1) all specify that language tags should be formatted according to BCP 47 / RFC 4646 which, I think, states that when a 2 letter ISO 639-1 language tag exists, it should be used instead of the equivalent 3-letter ISO 639-2 or ISO 639-3 codes. So you should never see a literal such as book@eng in well-formed RDF data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] jena pull request: Jena-text multilingual alternative implementati...
GitHub user amiara514 opened a pull request: https://github.com/apache/jena/pull/64 Jena-text multilingual alternative implementation This proposal is an alternative of [pull52](https://github.com/apache/jena/pull/52) (JENA-928). It deals with a single index that includes language as a new field entry. You can merge this pull request into a Git repository by running: $ git pull https://github.com/LICEF/jena jena-text-ml-single-index Alternatively you can review and apply these changes as the patch at: https://github.com/apache/jena/pull/64.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #64 commit 9553c6b2c246bc9c05906096c1f56d65ba15fed8 Author: Alexis Miara alexis_mi...@hotmail.com Date: 2015-05-13T15:23:56Z Implementation of jena-text multilingual with a single index --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---