Re: [basex-talk] multi-language full-text indexing
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi, I just want to say that for the dictionary that I used BaseX for, having a multi-lingual full text would have been very nice. Bar that a partial index based on certain rules the user supplies would have also been nice. For instance, being able to distinguish between á, a, and ā in a word. In early Irish textual criticism, length marks are often added by text editors with a macron to denote a long vowel that has been idenified by the editor but is not in the original text. Being able to say: build an index with á and a but not ā would be helpful. I would suggest as a first pass building the index by using xml:lang attributes to determine what stemmer to use, etc. If the document has supplied them, you could use them to build the indices differently based on it. All the best, Chris On Wed, Apr 22, 2015 at 11:35:48AM +0200, Goetz Heller wrote: Here's another addendum: Even if multi-language full-text indexing is not going tob e implemented in the near future, it still would be a useful feature to be able to restrict full-text indexing to parts of a document, e.g. CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/PART_A, (path_b)/ PART_B,… ) Kind regards, Goetz -Ursprüngliche Nachricht- Von: Christian Grün [mailto:christian.gr...@gmail.com] Gesendet: Mittwoch, 22. April 2015 11:03 An: Goetz Heller Cc: BaseX Betreff: Re: [basex-talk] multi-language full-text indexing It is desirable to have documents indexed by locale-specific parts, e.g. I can see that this would absolutely make sense, but it would be quite some effort to realize it. There are also various conceptul issues related to XQuery Full Text: If you don't specify the language in the query, we'd need to dynamically decide what stemmers to use for the query strings, depending on the nodes that are currently targeted. This would pretty much blow up the existing architecture. As there are so many other types of index structures that could be helpful, depending on the particular use case, we usually recommend users to create additional BaseX databases, which can then serve as indexes. This can all be done in XQuery. I remember there have been various examples for this on this mailing list (see e.g. [1,2]). Hope this helps, Christian [1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html [2] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang],… ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, … ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature? -BEGIN PGP SIGNATURE- Version: GnuPG v1 iF4EAREIAAYFAlU4EIcACgkQDjE+CSbP7HrWyAEAsW698gwkbKrtEb9Vkv7S1aV/ r5YyIp/UKv9k8gYxqboA/0+oesCxM6K2dZGEfIolUJG+x3vhxMfocY+QGpwjtLhy =Zq8Q -END PGP SIGNATURE-
Re: [basex-talk] multi-language full-text indexing
Chris, Thanks for your feedback. Yes, I see that there is a lot of demand for a more customizable full-text index. Did you already try to build some additional index databases, based on the rules you were listing here? It's not as comfortable as a tightly coupled full-text index, but the more use case I get to hear of, the more I wonder if we could at all manage to satisfy everyone's needs.. Cheers, Christian Wed, Apr 22, 2015 at 11:20 PM, Chris Yocum cyo...@gmail.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi, I just want to say that for the dictionary that I used BaseX for, having a multi-lingual full text would have been very nice. Bar that a partial index based on certain rules the user supplies would have also been nice. For instance, being able to distinguish between á, a, and ā in a word. In early Irish textual criticism, length marks are often added by text editors with a macron to denote a long vowel that has been idenified by the editor but is not in the original text. Being able to say: build an index with á and a but not ā would be helpful. I would suggest as a first pass building the index by using xml:lang attributes to determine what stemmer to use, etc. If the document has supplied them, you could use them to build the indices differently based on it. All the best, Chris On Wed, Apr 22, 2015 at 11:35:48AM +0200, Goetz Heller wrote: Here's another addendum: Even if multi-language full-text indexing is not going tob e implemented in the near future, it still would be a useful feature to be able to restrict full-text indexing to parts of a document, e.g. CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/PART_A, (path_b)/ PART_B,… ) Kind regards, Goetz -Ursprüngliche Nachricht- Von: Christian Grün [mailto:christian.gr...@gmail.com] Gesendet: Mittwoch, 22. April 2015 11:03 An: Goetz Heller Cc: BaseX Betreff: Re: [basex-talk] multi-language full-text indexing It is desirable to have documents indexed by locale-specific parts, e.g. I can see that this would absolutely make sense, but it would be quite some effort to realize it. There are also various conceptul issues related to XQuery Full Text: If you don't specify the language in the query, we'd need to dynamically decide what stemmers to use for the query strings, depending on the nodes that are currently targeted. This would pretty much blow up the existing architecture. As there are so many other types of index structures that could be helpful, depending on the particular use case, we usually recommend users to create additional BaseX databases, which can then serve as indexes. This can all be done in XQuery. I remember there have been various examples for this on this mailing list (see e.g. [1,2]). Hope this helps, Christian [1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html [2] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang],… ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, … ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature? -BEGIN PGP SIGNATURE- Version: GnuPG v1 iF4EAREIAAYFAlU4EIcACgkQDjE+CSbP7HrWyAEAsW698gwkbKrtEb9Vkv7S1aV/ r5YyIp/UKv9k8gYxqboA/0+oesCxM6K2dZGEfIolUJG+x3vhxMfocY+QGpwjtLhy =Zq8Q -END PGP SIGNATURE-
Re: [basex-talk] Creation of Full-Text-Index failed
Thank you. I tried it out, and now everything works fine. Kind regards, Goetz -Ursprüngliche Nachricht- Von: Christian Grün [mailto:christian.gr...@gmail.com] Gesendet: Mittwoch, 22. April 2015 13:22 An: Goetz Heller Cc: BaseX Betreff: Re: [basex-talk] Creation of Full-Text-Index failed ...enjoy the fixed version [1]. Christian [1] http://files.basex.org/releases/latest On Tue, Apr 21, 2015 at 8:56 PM, Goetz Heller hel...@hellerim.de wrote: For the task at hand I need to create a database on a daily base from file packages I received. The language taken here is German, however the files contain lots of international characters as well. Usually this does not harm, and I don’t know if this is the real cause of failure in this case. Actually, the database was created, but an error message occurred which was not very specific: “file xxx could not be parsed”. File “xxx” was the last file of the package, and it was accessible for xQuery search. However, no full-text index was created as with the other packages. Trying to create the index directly resulted in a different message: “Improper use? … Stack Trace: java.lang.ArrayIndexOutOfBoundsException”. The package can be downloaded from http://www.hellerim.de/downloads/BaseX/20150203_023.7z. This does not look like a problem with the data but rather like a bug in BaseX. If I’m wrong, however, I would prefer to get a message which points me to the problem so I can try to solve it. Kind regards, Goetz
Re: [basex-talk] Distributing queries to several on several processors
Hi Götz, it would make perfect sense to parallelize the query. Is there a way to achieve this using xQuery? Our initial attempts to integrate low-level support for parallelization in XQuery turned out not to be as successful as we hoped they would be. One reason for that is that you can basically do everything with XQuery, and it's pretty hard to detect patterns in the code that are simple enough to be parallelized. Next to that, Java does not give us enough facilities to control CPU caching behavior. As you already indicated, you can simply run multiple queries in parallel by e.g. using Java threads or the BaseX client/server architecture (which by default allows 8 transactions in parallel [1]). If your queries do a lot of I/O, you will often get better performance by only allowing one transaction at a time, though. This is due to the random access patterns on your external drives (and in my experience, it also applies to SSDs). However, if you work with main-memory instances of databases, parallelization might give you some performance gains (albeit not as big as you might expect). Hope this helps, Christian [1] http://docs.basex.org/wiki/Options#PARALLEL
Re: [basex-talk] RESTXQ accept/produces issue
Hi Marc, If the %rest:produces annotation is specified, a function will only be invoked if the HTTP Accept header of the request matches one of the given types, or if it does not specify any HTTP Accept header at all. I asked Adam a while ago to get the online version of the spec updated, but I think he's pretty busy with other things right now. Maybe you could add this to the EXQuery tracker [1]? Best, Christian [1] https://github.com/exquery/exquery On Wed, Apr 22, 2015 at 9:32 AM, Marc van Grootel marc.van.groo...@gmail.com wrote: Hi Christian, You are right, foolish of me not to verify on latest or even on 8.1 were this was fixed already. I was hitting an API that was part of our software which used an 8.0 version still. Just verified it on 8.1 and latest snapshot and there it's fine. One nitpick for the RESTXQ spec though. Shouldn't the passage I quoted be modified to read something like: If the %rest:produces annotation is specified, a function will only be invoked if the HTTP Accept header of the request matches one of the given types, or if it does not specify any HTTP Accept header at all. Cheers, --Marc On Tue, Apr 21, 2015 at 4:53 PM, Christian Grün christian.gr...@gmail.com wrote: Hi Marc, I remember this issue has been discussed before (I just cannot find any online reference). I agree that the produces annotation should be ignored if no Accept header is given.. Which version have you been using? Does it occur with the latest snapshot? Thanks in advance, Christian On Tue, Apr 21, 2015 at 2:05 PM, Marc van Grootel marc.van.groo...@gmail.com wrote: Hi, I spend a couple of hours pulling my hair before I realized what was going on here. Question: what happens when I call a RESTXQ function which has a rest:produces('application/xml') annotation but the request does not have a Accept header? This is what HTTP 1.1 spec[1] says about that: If no Accept header field is present, then it is assumed that the client accepts all media types. If an Accept header field is present, and if the server cannot send a response which is acceptable according to the combined Accept field value, then the server SHOULD send a 406 (not acceptable) response. In fact, what does happen is that you get a 404, and this is caused by the rest:produces annotation. In a REST call you do not always set or have the option to set an appropriate accept header (e.g. HTTP client libraries or when doing doc('http://.') call from XSLT). I believe that when no Accept header is present the response should assume that any mediatype is ok. Additionally it would be nice for REST clients if in case the path matches but the content-negotiation fails that a 406 would be returned instead of a 404. The latter is saying the resource does not exist, whereas 406 expresses that the issue is with the media-type but the resource may exist. Quite possibly the text in the RESTXQ spec has to be modified as well in that case because it currentlly reads (consistent with current behaviour): If the %rest:produces annotation is specified, a function will only be invoked if the HTTP Accept header of the request matches one of the given types. Would it be possible to get this changed? Or is it maybe better to take this up in another forum? [1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html [2] http://exquery.github.io/exquery/exquery-restxq-specification/restxq-1.0-specification.html#produces-annotation --Marc -- --Marc
Re: [basex-talk] IllegalMonitorStateException at org.basex.core.locks.DBLocking
Hi Simon, I finally had time to look at your examples, and... One more detail: [...] ...seemed to fix it! The original version of this class was written by Jens (in the cc), but I also believe that the basic problem was that the locks instance was not synchronized. In my fix, I used a ConcurrentHashMap instance and changed other minor things in the code (see [1], 8.2 branch). A new snapshot is available, too.. Thanks for the helpful feedback! Christian [1] https://github.com/BaseXdb/basex/commit/d3503d36325cb0fea58a13ee9f54feb5ce8868a6 [2] http://files.basex.org/releases/latest if in the unsetLockIfUnused() method in the DBLocking class, I put the locks.remove(object) call in a synchronized(locks){} block, the problem does not appear any more, but as I don't understand exactly what the problem is, I am not sure if it really solves it or if it just change the timing a little bit making the problem less likely to happen. Regards Simon Hello Christian, After some testing on my side, I didn't see the ConcurrentModificationException any more. That's the good news. However, when running a slightly modified version of the small test case you wrote from my sample application, I faced another problem. The test can run for a few seconds to almost an hour but eventually, the following exception is thrown. java.lang.IllegalMonitorStateException at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryRelease(ReentrantReadWriteLock.java:374) at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1260) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.unlock(ReentrantReadWriteLock.java:1131) at org.basex.core.locks.DBLocking.release(DBLocking.java:215) at org.basex.core.Context.unregister(Context.java:287) at org.basex.core.Command.execute(Command.java:105) at org.basex.api.client.LocalSession.execute(LocalSession.java:132) at org.basex.api.client.Session.execute(Session.java:36) at basextest.BaseXTestDBLocking$1.run(BaseXTestDBLocking.java:43) I attach my new test case (BaseXTestAdd.java), but the main modification is that in each created thread I also add documents to the collections instead of only opening it. I was also able to see that in the call to getOrCreateLock() in DBLocking#release(final Proc pr) (line 212) the lock is created while it should already be in the locks Map, but I really cannot understand how this is possible. It would mean that the lock was removed by another thread but for that the usage value must be wrong in the lockUsage map, and I cannot find any sequence of operation that could lead to such a situation. Trying to pin-point more precisely the problem, I wrote another test (BaseXTestDBLocking.java) that calls directly the acquire and release methods of the DBLocking class. The problem seems to happen more quickly with this test. Any thoughts ? Regards Simon
Re: [basex-talk] Distributing queries to several on several processors
Hi Erol, I am not volunteering :-) but if somebody wants to take this route this code might give some pointers [1]. It uses Apache Spark to run Saxon-HE, an XQuery example [2], and more info [3]. /Andy [1] https://github.com/elsevierlabs/spark-xml-utils [2] https://github.com/elsevierlabs/spark-xml-utils/wiki/xquery [3] http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1407936616.34624.yahoomail...@web141003.mail.bf1.yahoo.com%3E On 22 April 2015 at 10:05, Erol Akarsu eaka...@gmail.com wrote: Christian, I think we should be able to attach BaseX to Apache spark. But integration code need to be written. Everybody is able to read from Hadoop,SOLR, ElasticSearch etc. to Spark and process there. Why not for BaseX? Erol Akarsu On Wed, Apr 22, 2015 at 4:28 AM, Christian Grün christian.gr...@gmail.com wrote: Hi Götz, it would make perfect sense to parallelize the query. Is there a way to achieve this using xQuery? Our initial attempts to integrate low-level support for parallelization in XQuery turned out not to be as successful as we hoped they would be. One reason for that is that you can basically do everything with XQuery, and it's pretty hard to detect patterns in the code that are simple enough to be parallelized. Next to that, Java does not give us enough facilities to control CPU caching behavior. As you already indicated, you can simply run multiple queries in parallel by e.g. using Java threads or the BaseX client/server architecture (which by default allows 8 transactions in parallel [1]). If your queries do a lot of I/O, you will often get better performance by only allowing one transaction at a time, though. This is due to the random access patterns on your external drives (and in my experience, it also applies to SSDs). However, if you work with main-memory instances of databases, parallelization might give you some performance gains (albeit not as big as you might expect). Hope this helps, Christian [1] http://docs.basex.org/wiki/Options#PARALLEL
[basex-talk] multi-language full-text indexing
Here's another addendum: Even if multi-language full-text indexing is not going tob e implemented in the near future, it still would be a useful feature to be able to restrict full-text indexing to parts of a document, e.g. CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/PART_A, (path_b)/ PART_B,… ) Kind regards, Goetz -Ursprüngliche Nachricht- Von: Christian Grün [mailto:christian.gr...@gmail.com] Gesendet: Mittwoch, 22. April 2015 11:03 An: Goetz Heller Cc: BaseX Betreff: Re: [basex-talk] multi-language full-text indexing It is desirable to have documents indexed by locale-specific parts, e.g. I can see that this would absolutely make sense, but it would be quite some effort to realize it. There are also various conceptul issues related to XQuery Full Text: If you don't specify the language in the query, we'd need to dynamically decide what stemmers to use for the query strings, depending on the nodes that are currently targeted. This would pretty much blow up the existing architecture. As there are so many other types of index structures that could be helpful, depending on the particular use case, we usually recommend users to create additional BaseX databases, which can then serve as indexes. This can all be done in XQuery. I remember there have been various examples for this on this mailing list (see e.g. [1,2]). Hope this helps, Christian [1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html [2] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang],… ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, … ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature?
Re: [basex-talk] multi-language full-text indexing
Reminds me of an old GitHub issue.. I have added a link to your request: https://github.com/BaseXdb/basex/issues/59. On Wed, Apr 22, 2015 at 11:35 AM, Goetz Heller hel...@hellerim.de wrote: Here's another addendum: Even if multi-language full-text indexing is not going tob e implemented in the near future, it still would be a useful feature to be able to restrict full-text indexing to parts of a document, e.g. CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/PART_A, (path_b)/ PART_B,… ) Kind regards, Goetz -Ursprüngliche Nachricht- Von: Christian Grün [mailto:christian.gr...@gmail.com] Gesendet: Mittwoch, 22. April 2015 11:03 An: Goetz Heller Cc: BaseX Betreff: Re: [basex-talk] multi-language full-text indexing It is desirable to have documents indexed by locale-specific parts, e.g. I can see that this would absolutely make sense, but it would be quite some effort to realize it. There are also various conceptul issues related to XQuery Full Text: If you don't specify the language in the query, we'd need to dynamically decide what stemmers to use for the query strings, depending on the nodes that are currently targeted. This would pretty much blow up the existing architecture. As there are so many other types of index structures that could be helpful, depending on the particular use case, we usually recommend users to create additional BaseX databases, which can then serve as indexes. This can all be done in XQuery. I remember there have been various examples for this on this mailing list (see e.g. [1,2]). Hope this helps, Christian [1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html [2] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang],… ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, … ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature?
Re: [basex-talk] multi-language full-text indexing
Thanks, Fabrice! Ill work it out. Kind regards, Goetz Von: Fabrice Etanchaud [mailto:fetanch...@questel.com] Gesendet: Mittwoch, 22. April 2015 11:32 An: Goetz Heller; basex-talk@mailman.uni-konstanz.de Betreff: RE: [basex-talk] multi-language full-text indexing Great, Goetz ! A last thing : If you need to rebuild the original document from parts, be sure to have a way to retrieve them all (by document path, attribute index, or separate index collection with node-id/pre values). If disk space is not an issue, you could store the original document as it is, and create localized collection for full text indexing purposes. Hoping it helps, Best regards, Fabrice De : basex-talk-boun...@mailman.uni-konstanz.de mailto:basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz Heller Envoyé : mercredi 22 avril 2015 11:20 À : basex-talk@mailman.uni-konstanz.de mailto:basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] multi-language full-text indexing Fabrice, For the time being, this sounds quite nice. Id to split up the files in some common part and a set of satellites, one satellite for each language present in the document. Thanks! Kind regards, Goetz Von: Fabrice Etanchaud [mailto:fetanch...@questel.com] Gesendet: Mittwoch, 22. April 2015 11:04 An: Goetz Heller; basex-talk@mailman.uni-konstanz.de mailto:basex-talk@mailman.uni-konstanz.de Betreff: RE: [basex-talk] multi-language full-text indexing Dear Goetz, I have the same requirement (patent documents containing text in different languages). I ended up splitting/filtering each original document in localized parts inserted in different collections (each collection having its own full text index configuration). BaseX is as flexible as our data ! Best regards, De : basex-talk-boun...@mailman.uni-konstanz.de mailto:basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz Heller Envoyé : mercredi 22 avril 2015 10:50 À : basex-talk@mailman.uni-konstanz.de mailto:basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] multi-language full-text indexing Im working with documents destined to be consumed anywhere in the European Community. Many of them have the same tags multiple times but with a different language attribute. It does not make sense to create a full-text index for the whole of these documents therefore. It is desirable to have documents indexed by locale-specific parts, e.g. CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang], ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as FOR LANGUAGE BG, for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature?
Re: [basex-talk] Creation of Full-Text-Index failed
...enjoy the fixed version [1]. Christian [1] http://files.basex.org/releases/latest On Tue, Apr 21, 2015 at 8:56 PM, Goetz Heller hel...@hellerim.de wrote: For the task at hand I need to create a database on a daily base from file packages I received. The language taken here is German, however the files contain lots of international characters as well. Usually this does not harm, and I don’t know if this is the real cause of failure in this case. Actually, the database was created, but an error message occurred which was not very specific: “file xxx could not be parsed”. File “xxx” was the last file of the package, and it was accessible for xQuery search. However, no full-text index was created as with the other packages. Trying to create the index directly resulted in a different message: “Improper use? … Stack Trace: java.lang.ArrayIndexOutOfBoundsException”. The package can be downloaded from http://www.hellerim.de/downloads/BaseX/20150203_023.7z. This does not look like a problem with the data but rather like a bug in BaseX. If I’m wrong, however, I would prefer to get a message which points me to the problem so I can try to solve it. Kind regards, Goetz
[basex-talk] multi-language full-text indexing
I'm working with documents destined to be consumed anywhere in the European Community. Many of them have the same tags multiple times but with a different language attribute. It does not make sense to create a full-text index for the whole of these documents therefore. It is desirable to have documents indexed by locale-specific parts, e.g. CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang],. ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, . ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as 'FOR LANGUAGE BG', for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature?
Re: [basex-talk] IllegalMonitorStateException at org.basex.core.locks.DBLocking
Hello, Excellent. Glad to be of use. I'll try the new snapshot right away. Cheers Simon On Wed, Apr 22, 2015 at 10:06 AM, Christian Grün christian.gr...@gmail.com wrote: Hi Simon, I finally had time to look at your examples, and... One more detail: [...] ...seemed to fix it! The original version of this class was written by Jens (in the cc), but I also believe that the basic problem was that the locks instance was not synchronized. In my fix, I used a ConcurrentHashMap instance and changed other minor things in the code (see [1], 8.2 branch). A new snapshot is available, too.. Thanks for the helpful feedback! Christian [1] https://github.com/BaseXdb/basex/commit/d3503d36325cb0fea58a13ee9f54feb5ce8868a6 [2] http://files.basex.org/releases/latest if in the unsetLockIfUnused() method in the DBLocking class, I put the locks.remove(object) call in a synchronized(locks){} block, the problem does not appear any more, but as I don't understand exactly what the problem is, I am not sure if it really solves it or if it just change the timing a little bit making the problem less likely to happen. Regards Simon Hello Christian, After some testing on my side, I didn't see the ConcurrentModificationException any more. That's the good news. However, when running a slightly modified version of the small test case you wrote from my sample application, I faced another problem. The test can run for a few seconds to almost an hour but eventually, the following exception is thrown. java.lang.IllegalMonitorStateException at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryRelease(ReentrantReadWriteLock.java:374) at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1260) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.unlock(ReentrantReadWriteLock.java:1131) at org.basex.core.locks.DBLocking.release(DBLocking.java:215) at org.basex.core.Context.unregister(Context.java:287) at org.basex.core.Command.execute(Command.java:105) at org.basex.api.client.LocalSession.execute(LocalSession.java:132) at org.basex.api.client.Session.execute(Session.java:36) at basextest.BaseXTestDBLocking$1.run(BaseXTestDBLocking.java:43) I attach my new test case (BaseXTestAdd.java), but the main modification is that in each created thread I also add documents to the collections instead of only opening it. I was also able to see that in the call to getOrCreateLock() in DBLocking#release(final Proc pr) (line 212) the lock is created while it should already be in the locks Map, but I really cannot understand how this is possible. It would mean that the lock was removed by another thread but for that the usage value must be wrong in the lockUsage map, and I cannot find any sequence of operation that could lead to such a situation. Trying to pin-point more precisely the problem, I wrote another test (BaseXTestDBLocking.java) that calls directly the acquire and release methods of the DBLocking class. The problem seems to happen more quickly with this test. Any thoughts ? Regards Simon
Re: [basex-talk] multi-language full-text indexing
Dear Goetz, I have the same requirement (patent documents containing text in different languages). I ended up splitting/filtering each original document in localized parts inserted in different collections (each collection having its own full text index configuration). BaseX is as flexible as our data ! Best regards, De : basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz Heller Envoyé : mercredi 22 avril 2015 10:50 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] multi-language full-text indexing I'm working with documents destined to be consumed anywhere in the European Community. Many of them have the same tags multiple times but with a different language attribute. It does not make sense to create a full-text index for the whole of these documents therefore. It is desirable to have documents indexed by locale-specific parts, e.g. CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang],... ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, ... ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as 'FOR LANGUAGE BG', for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature?
Re: [basex-talk] multi-language full-text indexing
It is desirable to have documents indexed by locale-specific parts, e.g. I can see that this would absolutely make sense, but it would be quite some effort to realize it. There are also various conceptul issues related to XQuery Full Text: If you don't specify the language in the query, we'd need to dynamically decide what stemmers to use for the query strings, depending on the nodes that are currently targeted. This would pretty much blow up the existing architecture. As there are so many other types of index structures that could be helpful, depending on the particular use case, we usually recommend users to create additional BaseX databases, which can then serve as indexes. This can all be done in XQuery. I remember there have been various examples for this on this mailing list (see e.g. [1,2]). Hope this helps, Christian [1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html [2] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang],… ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, … ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature?
Re: [basex-talk] multi-language full-text indexing
In a nutshell: It would take some more time to explain all the implications.. We know that there are various non-trivial issues to be solved, as we already thought about adding such an index some years ago. Cheers, Christian On Wed, Apr 22, 2015 at 11:15 AM, Goetz Heller hel...@hellerim.de wrote: The case you described should be made a non-issue: If a multi-language full-text index was created then it was surely intended to execute searches within the confines of a specific language. Hence, if none was specified in the query, a runtime error should be thrown in such cases. Kind regards, Goetz -Ursprüngliche Nachricht- Von: Christian Grün [mailto:christian.gr...@gmail.com] Gesendet: Mittwoch, 22. April 2015 11:03 An: Goetz Heller Cc: BaseX Betreff: Re: [basex-talk] multi-language full-text indexing It is desirable to have documents indexed by locale-specific parts, e.g. I can see that this would absolutely make sense, but it would be quite some effort to realize it. There are also various conceptul issues related to XQuery Full Text: If you don't specify the language in the query, we'd need to dynamically decide what stemmers to use for the query strings, depending on the nodes that are currently targeted. This would pretty much blow up the existing architecture. As there are so many other types of index structures that could be helpful, depending on the particular use case, we usually recommend users to create additional BaseX databases, which can then serve as indexes. This can all be done in XQuery. I remember there have been various examples for this on this mailing list (see e.g. [1,2]). Hope this helps, Christian [1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html [2] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang],… ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, … ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature?
Re: [basex-talk] multi-language full-text indexing
Great, Goetz ! A last thing : If you need to rebuild the original document from parts, be sure to have a way to retrieve them all (by document path, attribute index, or separate index collection with node-id/pre values). If disk space is not an issue, you could store the original document as it is, and create localized collection for full text indexing purposes. Hoping it helps, Best regards, Fabrice De : basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz Heller Envoyé : mercredi 22 avril 2015 11:20 À : basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] multi-language full-text indexing Fabrice, For the time being, this sounds quite nice. I'd to split up the files in some common part and a set of satellites, one satellite for each language present in the document. Thanks! Kind regards, Goetz Von: Fabrice Etanchaud [mailto:fetanch...@questel.com] Gesendet: Mittwoch, 22. April 2015 11:04 An: Goetz Heller; basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de Betreff: RE: [basex-talk] multi-language full-text indexing Dear Goetz, I have the same requirement (patent documents containing text in different languages). I ended up splitting/filtering each original document in localized parts inserted in different collections (each collection having its own full text index configuration). BaseX is as flexible as our data ! Best regards, De : basex-talk-boun...@mailman.uni-konstanz.demailto:basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz Heller Envoyé : mercredi 22 avril 2015 10:50 À : basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] multi-language full-text indexing I'm working with documents destined to be consumed anywhere in the European Community. Many of them have the same tags multiple times but with a different language attribute. It does not make sense to create a full-text index for the whole of these documents therefore. It is desirable to have documents indexed by locale-specific parts, e.g. CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang],... ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, ... ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as 'FOR LANGUAGE BG', for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature?
Re: [basex-talk] Distributing queries to several on several processors
Any volunteers out there? ;) On Wed, Apr 22, 2015 at 11:05 AM, Erol Akarsu eaka...@gmail.com wrote: Christian, I think we should be able to attach BaseX to Apache spark. But integration code need to be written. Everybody is able to read from Hadoop,SOLR, ElasticSearch etc. to Spark and process there. Why not for BaseX? Erol Akarsu On Wed, Apr 22, 2015 at 4:28 AM, Christian Grün christian.gr...@gmail.com wrote: Hi Götz, it would make perfect sense to parallelize the query. Is there a way to achieve this using xQuery? Our initial attempts to integrate low-level support for parallelization in XQuery turned out not to be as successful as we hoped they would be. One reason for that is that you can basically do everything with XQuery, and it's pretty hard to detect patterns in the code that are simple enough to be parallelized. Next to that, Java does not give us enough facilities to control CPU caching behavior. As you already indicated, you can simply run multiple queries in parallel by e.g. using Java threads or the BaseX client/server architecture (which by default allows 8 transactions in parallel [1]). If your queries do a lot of I/O, you will often get better performance by only allowing one transaction at a time, though. This is due to the random access patterns on your external drives (and in my experience, it also applies to SSDs). However, if you work with main-memory instances of databases, parallelization might give you some performance gains (albeit not as big as you might expect). Hope this helps, Christian [1] http://docs.basex.org/wiki/Options#PARALLEL
[basex-talk] multi-language full-text indexing
The case you described should be made a non-issue: If a multi-language full-text index was created then it was surely intended to execute searches within the confines of a specific language. Hence, if none was specified in the query, a runtime error should be thrown in such cases. Kind regards, Goetz -Ursprüngliche Nachricht- Von: Christian Grün [mailto:christian.gr...@gmail.com] Gesendet: Mittwoch, 22. April 2015 11:03 An: Goetz Heller Cc: BaseX Betreff: Re: [basex-talk] multi-language full-text indexing It is desirable to have documents indexed by locale-specific parts, e.g. I can see that this would absolutely make sense, but it would be quite some effort to realize it. There are also various conceptul issues related to XQuery Full Text: If you don't specify the language in the query, we'd need to dynamically decide what stemmers to use for the query strings, depending on the nodes that are currently targeted. This would pretty much blow up the existing architecture. As there are so many other types of index structures that could be helpful, depending on the particular use case, we usually recommend users to create additional BaseX databases, which can then serve as indexes. This can all be done in XQuery. I remember there have been various examples for this on this mailing list (see e.g. [1,2]). Hope this helps, Christian [1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html [2] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang],… ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, … ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature?
Re: [basex-talk] Distributing queries to several on several processors
Hi Götz (cc @ basex-talk), OK, I think I understand. However, I think there should be some possibilities to allow the user to give hints. In my opinion, FOR-loops would be first-class candidates to use parallel streams, in particular in the use case I described in my previous posting: FOR $var IN (collection) PARALLEL RETURN (expression-list) Makes sense, in general.. XQuery pragmas could be solution: (# basex: parallel #) { ... } Higher-order functions provide functions like hof:parallel-map(...). However, it has many effects on the architecture of BaseX in terms of performance, because we'd need to create new contexts for each parallelized query, which takes additional time. See the following query as example: $x[. = 123] The dot applies to the current context item. If we parallelize a query, we'd have multiple current context items. The same multiplication would apply to the stack frame and other runtime variables, and the time lost for duplicating these instances is in most cases more expensive than doing stuff in a single thread. At least that's our experience so far. Once again, we are happy to see people jump into our code and show us that it can be done better.. Christian