Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Chris Yocum
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi,

I just want to say that for the dictionary that I used BaseX for,
having a multi-lingual full text would have been very nice.  Bar that
a partial index based on certain rules the user supplies would have
also been nice.  For instance, being able to distinguish between á, a,
and ā in a word.  In early Irish textual criticism, length marks are
often added by text editors with a macron to denote a long vowel that
has been idenified by the editor but is not in the original text.
Being able to say: build an index with á and a but not ā would be
helpful.

I would suggest as a first pass building the index by using xml:lang
attributes to determine what stemmer to use, etc.  If the document has
supplied them, you could use them to build the indices differently
based on it.

All the best,
Chris

On Wed, Apr 22, 2015 at 11:35:48AM +0200, Goetz Heller wrote:
 Here's another addendum: Even if multi-language full-text indexing is not 
 going tob e implemented in the near future, it still would be a useful 
 feature to be able to restrict  full-text indexing to parts of a document, 
 e.g.
 
 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
   (path_a)/PART_A,
   (path_b)/ PART_B,…
 )
 
 Kind regards,
 
 Goetz
 
 -Ursprüngliche Nachricht-
 Von: Christian Grün [mailto:christian.gr...@gmail.com] 
 Gesendet: Mittwoch, 22. April 2015 11:03
 An: Goetz Heller
 Cc: BaseX
 Betreff: Re: [basex-talk] multi-language full-text indexing
 
  It is desirable to have
  documents indexed by locale-specific parts, e.g.
 
 I can see that this would absolutely make sense, but it would be quite some 
 effort to realize it. There are also various conceptul issues related to 
 XQuery Full Text: If you don't specify the language in the query, we'd need 
 to dynamically decide what stemmers to use for the query strings, depending 
 on the nodes that are currently targeted.
 This would pretty much blow up the existing architecture.
 
 As there are so many other types of index structures that could be helpful, 
 depending on the particular use case, we usually recommend users to create 
 additional BaseX databases, which can then serve as indexes. This can all be 
 done in XQuery. I remember there have been various examples for this on this 
 mailing list (see e.g. [1,2]).
 
 Hope this helps,
 Christian
 
 [1] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
 [2] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html
 
 
 
 
 
 
 
  CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
 
  (path_a)/LOCALIZED_PART_A[@LANG=$lang],
 
  (path_b)/LOCALIZED_PART_B[@LG=$lang],…
 
  ) FOR LANGUAGE $lang IN (
 
  BG,
 
  DN,
 
  DE WITH STOPWORDS filepath_de WITH STEM = YES,
 
  EN WITH STOPWORDS filepath_en,
 
  FR, …
 
  )  [USING language_code_map]
 
  and then to write full-text retrieval queries with a clause such as 
  ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller 
  and full-text retrieval therefore much faster. The language codes 
  would be mapped somehow to standard values recognized by BaseX in the 
  language_code_map file.
 
  Are there any efforts towards such a feature?
 
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iF4EAREIAAYFAlU4EIcACgkQDjE+CSbP7HrWyAEAsW698gwkbKrtEb9Vkv7S1aV/
r5YyIp/UKv9k8gYxqboA/0+oesCxM6K2dZGEfIolUJG+x3vhxMfocY+QGpwjtLhy
=Zq8Q
-END PGP SIGNATURE-


Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Christian Grün
Chris,

Thanks for your feedback.

Yes, I see that there is a lot of demand for a more customizable
full-text index. Did you already try to build some additional index
databases, based on the rules you were listing here? It's not as
comfortable as a tightly coupled full-text index, but the more use
case I get to hear of, the more I wonder if we could at all manage to
satisfy everyone's needs..

Cheers,
Christian


 Wed, Apr 22, 2015 at 11:20 PM, Chris Yocum cyo...@gmail.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 Hi,

 I just want to say that for the dictionary that I used BaseX for,
 having a multi-lingual full text would have been very nice.  Bar that
 a partial index based on certain rules the user supplies would have
 also been nice.  For instance, being able to distinguish between á, a,
 and ā in a word.  In early Irish textual criticism, length marks are
 often added by text editors with a macron to denote a long vowel that
 has been idenified by the editor but is not in the original text.
 Being able to say: build an index with á and a but not ā would be
 helpful.

 I would suggest as a first pass building the index by using xml:lang
 attributes to determine what stemmer to use, etc.  If the document has
 supplied them, you could use them to build the indices differently
 based on it.

 All the best,
 Chris

 On Wed, Apr 22, 2015 at 11:35:48AM +0200, Goetz Heller wrote:
 Here's another addendum: Even if multi-language full-text indexing is not 
 going tob e implemented in the near future, it still would be a useful 
 feature to be able to restrict  full-text indexing to parts of a document, 
 e.g.

 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
   (path_a)/PART_A,
   (path_b)/ PART_B,…
 )

 Kind regards,

 Goetz

 -Ursprüngliche Nachricht-
 Von: Christian Grün [mailto:christian.gr...@gmail.com]
 Gesendet: Mittwoch, 22. April 2015 11:03
 An: Goetz Heller
 Cc: BaseX
 Betreff: Re: [basex-talk] multi-language full-text indexing

  It is desirable to have
  documents indexed by locale-specific parts, e.g.

 I can see that this would absolutely make sense, but it would be quite some 
 effort to realize it. There are also various conceptul issues related to 
 XQuery Full Text: If you don't specify the language in the query, we'd need 
 to dynamically decide what stemmers to use for the query strings, depending 
 on the nodes that are currently targeted.
 This would pretty much blow up the existing architecture.

 As there are so many other types of index structures that could be helpful, 
 depending on the particular use case, we usually recommend users to create 
 additional BaseX databases, which can then serve as indexes. This can all be 
 done in XQuery. I remember there have been various examples for this on this 
 mailing list (see e.g. [1,2]).

 Hope this helps,
 Christian

 [1] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
 [2] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html




 
 
 
  CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
 
  (path_a)/LOCALIZED_PART_A[@LANG=$lang],
 
  (path_b)/LOCALIZED_PART_B[@LG=$lang],…
 
  ) FOR LANGUAGE $lang IN (
 
  BG,
 
  DN,
 
  DE WITH STOPWORDS filepath_de WITH STEM = YES,
 
  EN WITH STOPWORDS filepath_en,
 
  FR, …
 
  )  [USING language_code_map]
 
  and then to write full-text retrieval queries with a clause such as
  ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller
  and full-text retrieval therefore much faster. The language codes
  would be mapped somehow to standard values recognized by BaseX in the
  language_code_map file.
 
  Are there any efforts towards such a feature?

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1

 iF4EAREIAAYFAlU4EIcACgkQDjE+CSbP7HrWyAEAsW698gwkbKrtEb9Vkv7S1aV/
 r5YyIp/UKv9k8gYxqboA/0+oesCxM6K2dZGEfIolUJG+x3vhxMfocY+QGpwjtLhy
 =Zq8Q
 -END PGP SIGNATURE-


Re: [basex-talk] Creation of Full-Text-Index failed

2015-04-22 Thread Goetz Heller
Thank you. I tried it out, and now everything works fine.

Kind regards,

Goetz

-Ursprüngliche Nachricht-
Von: Christian Grün [mailto:christian.gr...@gmail.com] 
Gesendet: Mittwoch, 22. April 2015 13:22
An: Goetz Heller
Cc: BaseX
Betreff: Re: [basex-talk] Creation of Full-Text-Index failed

...enjoy the fixed version [1].
Christian

[1] http://files.basex.org/releases/latest


On Tue, Apr 21, 2015 at 8:56 PM, Goetz Heller hel...@hellerim.de wrote:
 For the task at hand I need to create a database on a daily base from 
 file packages I received. The language taken here is German, however 
 the files contain lots of international characters as well. Usually 
 this does not harm, and I don’t know if this is the real cause of failure in 
 this case.

 Actually, the database was created, but an error message occurred 
 which was not very specific: “file xxx could not be parsed”. File 
 “xxx” was the last file of the package, and it was accessible for 
 xQuery search. However, no full-text index was created as with the 
 other packages. Trying to create the index directly resulted in a 
 different message: “Improper use? … Stack
 Trace: java.lang.ArrayIndexOutOfBoundsException”.



 The package can be downloaded from
 http://www.hellerim.de/downloads/BaseX/20150203_023.7z.



 This does not look like a problem with the data but rather like a bug 
 in BaseX. If I’m wrong, however, I would prefer to get a message which 
 points me to the problem so I can try to solve it.



 Kind regards,



 Goetz









Re: [basex-talk] Distributing queries to several on several processors

2015-04-22 Thread Christian Grün
Hi Götz,

 it would
 make perfect sense to parallelize the query. Is there a way to achieve this
 using xQuery?

Our initial attempts to integrate low-level support for
parallelization in XQuery turned out not to be as successful as we
hoped they would be. One reason for that is that you can basically do
everything with XQuery, and it's pretty hard to detect patterns in the
code that are simple enough to be parallelized. Next to that, Java
does not give us enough facilities to control CPU caching behavior.

As you already indicated, you can simply run multiple queries in
parallel by e.g. using Java threads or the BaseX client/server
architecture (which by default allows 8 transactions in parallel [1]).
If your queries do a lot of I/O, you will often get better performance
by only allowing one transaction at a time, though. This is due to the
random access patterns on your external drives (and in my experience,
it also applies to SSDs). However, if you work with main-memory
instances of databases, parallelization might give you some
performance gains (albeit not as big as you might expect).

Hope this helps,
Christian

[1] http://docs.basex.org/wiki/Options#PARALLEL


Re: [basex-talk] RESTXQ accept/produces issue

2015-04-22 Thread Christian Grün
Hi Marc,

 If the %rest:produces annotation is specified, a function will
 only be invoked if the HTTP Accept header of the request matches one
 of the given types, or if it does not specify any HTTP Accept header at all.

I asked Adam a while ago to get the online version of the spec
updated, but I think he's pretty busy with other things right now.
Maybe you could add this to the EXQuery tracker [1]?

Best,
Christian

[1] https://github.com/exquery/exquery



On Wed, Apr 22, 2015 at 9:32 AM, Marc van Grootel
marc.van.groo...@gmail.com wrote:
 Hi Christian,

 You are right, foolish of me not to verify on latest or even on 8.1
 were this was fixed already. I was hitting an API that was part of our
 software which used an 8.0 version still.

 Just verified it on 8.1 and latest snapshot and there it's fine.

 One nitpick for the RESTXQ spec though. Shouldn't the passage I quoted
 be modified to read something like:

 If the %rest:produces annotation is specified, a function will
 only be invoked if the HTTP Accept header of the request matches one
 of the given types, or if it does not specify any HTTP Accept header at all.

 Cheers,
 --Marc


 On Tue, Apr 21, 2015 at 4:53 PM, Christian Grün
 christian.gr...@gmail.com wrote:
 Hi Marc,

 I remember this issue has been discussed before (I just cannot find
 any online reference).

 I agree that the produces annotation should be ignored if no Accept
 header is given.. Which version have you been using? Does it occur
 with the latest snapshot?

 Thanks in advance,
 Christian


 On Tue, Apr 21, 2015 at 2:05 PM, Marc van Grootel
 marc.van.groo...@gmail.com wrote:
 Hi,

 I spend a couple of hours pulling my hair before I realized what was
 going on here.

 Question: what happens when I call a RESTXQ function which has a
 rest:produces('application/xml') annotation but the request does not
 have a Accept header?

 This is what HTTP 1.1 spec[1] says about that:

 If no Accept header field is present, then it is assumed that the
 client accepts all media types. If an Accept header field is present,
 and if the server cannot send a response which is acceptable according
 to the combined Accept field value, then the server SHOULD send a 406
 (not acceptable) response.

 In fact, what does happen is that you get a 404, and this is caused by
 the rest:produces annotation. In a REST call you do not always set or
 have the option to set an appropriate accept header  (e.g. HTTP client
 libraries or when doing doc('http://.') call from XSLT).

 I believe that when no Accept header is present the response should
 assume that any mediatype is ok. Additionally it would be nice for
 REST clients if in case the path matches but the content-negotiation
 fails that a 406 would be returned instead of a 404. The latter is
 saying the resource does not exist, whereas 406 expresses that the
 issue is with the media-type but the resource may exist.

 Quite possibly the text in the RESTXQ spec has to be modified as well
 in that case because it currentlly reads (consistent with current
 behaviour):

 If the %rest:produces annotation is specified, a function will
 only be invoked if the HTTP Accept header of the request matches one
 of the given types.

 Would it be possible to get this changed? Or is it maybe better to
 take this up in another forum?

 [1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
 [2] 
 http://exquery.github.io/exquery/exquery-restxq-specification/restxq-1.0-specification.html#produces-annotation

 --Marc



 --
 --Marc


Re: [basex-talk] IllegalMonitorStateException at org.basex.core.locks.DBLocking

2015-04-22 Thread Christian Grün
Hi Simon,

I finally had time to look at your examples, and...

 One more detail: [...]

...seemed to fix it! The original version of this class was written by
Jens (in the cc), but I also believe that the basic problem was that
the locks instance was not synchronized. In my fix, I used a
ConcurrentHashMap instance and changed other minor things in the code
(see [1], 8.2 branch).

A new snapshot is available, too..

Thanks for the helpful feedback!
Christian

[1] 
https://github.com/BaseXdb/basex/commit/d3503d36325cb0fea58a13ee9f54feb5ce8868a6
[2] http://files.basex.org/releases/latest


if in the unsetLockIfUnused() method in the DBLocking
 class, I put the locks.remove(object) call in a synchronized(locks){} block,
 the problem does not appear any more, but as I don't understand exactly what
 the problem is, I am not sure if it really solves it or if it just change
 the timing a little bit making the problem less likely to happen.

 Regards

 Simon




 Hello Christian,

 After some testing on my side, I didn't see the
 ConcurrentModificationException any more.
 That's the good news.

 However, when running a slightly modified version of the small test case
 you wrote from my sample application, I faced another problem.
 The test can run for a few seconds to almost an hour but eventually, the
 following exception is thrown.

 java.lang.IllegalMonitorStateException
 at
 java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryRelease(ReentrantReadWriteLock.java:374)
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1260)
 at
 java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.unlock(ReentrantReadWriteLock.java:1131)
 at org.basex.core.locks.DBLocking.release(DBLocking.java:215)
 at org.basex.core.Context.unregister(Context.java:287)
 at org.basex.core.Command.execute(Command.java:105)
 at org.basex.api.client.LocalSession.execute(LocalSession.java:132)
 at org.basex.api.client.Session.execute(Session.java:36)
 at basextest.BaseXTestDBLocking$1.run(BaseXTestDBLocking.java:43)

 I attach my new test case (BaseXTestAdd.java), but the main modification
 is that in each created thread I also add documents to the collections
 instead of only opening it.

 I was also able to see that in the call to getOrCreateLock() in
 DBLocking#release(final Proc pr) (line 212) the lock is created while it
 should already be in the locks Map, but I really cannot understand how this
 is possible. It would mean that the lock was removed by another thread but
 for that the usage value must be wrong in the lockUsage map, and I cannot
 find any sequence of operation that could lead to such a situation.

 Trying to pin-point more precisely the problem, I wrote another test
 (BaseXTestDBLocking.java) that calls directly the acquire and release
 methods of the DBLocking class. The problem seems to happen more quickly
 with this test.


 Any thoughts ?

 Regards

 Simon




Re: [basex-talk] Distributing queries to several on several processors

2015-04-22 Thread Andy Bunce
Hi Erol,

I am not volunteering :-) but if somebody wants to take this route this
code might give some pointers [1].
It uses Apache Spark to run Saxon-HE, an XQuery  example [2], and more info
[3].

/Andy

[1] https://github.com/elsevierlabs/spark-xml-utils
[2] https://github.com/elsevierlabs/spark-xml-utils/wiki/xquery
[3]
http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1407936616.34624.yahoomail...@web141003.mail.bf1.yahoo.com%3E

On 22 April 2015 at 10:05, Erol Akarsu eaka...@gmail.com wrote:

 Christian,

 I think we should be able to attach BaseX to Apache spark. But integration
 code need to be written.
 Everybody is able to read from Hadoop,SOLR, ElasticSearch etc. to Spark
 and process there.
 Why not for BaseX?

 Erol Akarsu

 On Wed, Apr 22, 2015 at 4:28 AM, Christian Grün christian.gr...@gmail.com
  wrote:

 Hi Götz,

  it would
  make perfect sense to parallelize the query. Is there a way to achieve
 this
  using xQuery?

 Our initial attempts to integrate low-level support for
 parallelization in XQuery turned out not to be as successful as we
 hoped they would be. One reason for that is that you can basically do
 everything with XQuery, and it's pretty hard to detect patterns in the
 code that are simple enough to be parallelized. Next to that, Java
 does not give us enough facilities to control CPU caching behavior.

 As you already indicated, you can simply run multiple queries in
 parallel by e.g. using Java threads or the BaseX client/server
 architecture (which by default allows 8 transactions in parallel [1]).
 If your queries do a lot of I/O, you will often get better performance
 by only allowing one transaction at a time, though. This is due to the
 random access patterns on your external drives (and in my experience,
 it also applies to SSDs). However, if you work with main-memory
 instances of databases, parallelization might give you some
 performance gains (albeit not as big as you might expect).

 Hope this helps,
 Christian

 [1] http://docs.basex.org/wiki/Options#PARALLEL





[basex-talk] multi-language full-text indexing

2015-04-22 Thread Goetz Heller
Here's another addendum: Even if multi-language full-text indexing is not going 
tob e implemented in the near future, it still would be a useful feature to be 
able to restrict  full-text indexing to parts of a document, e.g.

CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/PART_A,
(path_b)/ PART_B,…
)

Kind regards,

Goetz

-Ursprüngliche Nachricht-
Von: Christian Grün [mailto:christian.gr...@gmail.com] 
Gesendet: Mittwoch, 22. April 2015 11:03
An: Goetz Heller
Cc: BaseX
Betreff: Re: [basex-talk] multi-language full-text indexing

 It is desirable to have
 documents indexed by locale-specific parts, e.g.

I can see that this would absolutely make sense, but it would be quite some 
effort to realize it. There are also various conceptul issues related to XQuery 
Full Text: If you don't specify the language in the query, we'd need to 
dynamically decide what stemmers to use for the query strings, depending on the 
nodes that are currently targeted.
This would pretty much blow up the existing architecture.

As there are so many other types of index structures that could be helpful, 
depending on the particular use case, we usually recommend users to create 
additional BaseX databases, which can then serve as indexes. This can all be 
done in XQuery. I remember there have been various examples for this on this 
mailing list (see e.g. [1,2]).

Hope this helps,
Christian

[1] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
[2] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html







 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

 (path_a)/LOCALIZED_PART_A[@LANG=$lang],

 (path_b)/LOCALIZED_PART_B[@LG=$lang],…

 ) FOR LANGUAGE $lang IN (

 BG,

 DN,

 DE WITH STOPWORDS filepath_de WITH STEM = YES,

 EN WITH STOPWORDS filepath_en,

 FR, …

 )  [USING language_code_map]

 and then to write full-text retrieval queries with a clause such as 
 ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller 
 and full-text retrieval therefore much faster. The language codes 
 would be mapped somehow to standard values recognized by BaseX in the 
 language_code_map file.

 Are there any efforts towards such a feature?



Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Christian Grün
Reminds me of an old GitHub issue.. I have added a link to your
request: https://github.com/BaseXdb/basex/issues/59.


On Wed, Apr 22, 2015 at 11:35 AM, Goetz Heller hel...@hellerim.de wrote:
 Here's another addendum: Even if multi-language full-text indexing is not 
 going tob e implemented in the near future, it still would be a useful 
 feature to be able to restrict  full-text indexing to parts of a document, 
 e.g.

 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
 (path_a)/PART_A,
 (path_b)/ PART_B,…
 )

 Kind regards,

 Goetz

 -Ursprüngliche Nachricht-
 Von: Christian Grün [mailto:christian.gr...@gmail.com]
 Gesendet: Mittwoch, 22. April 2015 11:03
 An: Goetz Heller
 Cc: BaseX
 Betreff: Re: [basex-talk] multi-language full-text indexing

 It is desirable to have
 documents indexed by locale-specific parts, e.g.

 I can see that this would absolutely make sense, but it would be quite some 
 effort to realize it. There are also various conceptul issues related to 
 XQuery Full Text: If you don't specify the language in the query, we'd need 
 to dynamically decide what stemmers to use for the query strings, depending 
 on the nodes that are currently targeted.
 This would pretty much blow up the existing architecture.

 As there are so many other types of index structures that could be helpful, 
 depending on the particular use case, we usually recommend users to create 
 additional BaseX databases, which can then serve as indexes. This can all be 
 done in XQuery. I remember there have been various examples for this on this 
 mailing list (see e.g. [1,2]).

 Hope this helps,
 Christian

 [1] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
 [2] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html







 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

 (path_a)/LOCALIZED_PART_A[@LANG=$lang],

 (path_b)/LOCALIZED_PART_B[@LG=$lang],…

 ) FOR LANGUAGE $lang IN (

 BG,

 DN,

 DE WITH STOPWORDS filepath_de WITH STEM = YES,

 EN WITH STOPWORDS filepath_en,

 FR, …

 )  [USING language_code_map]

 and then to write full-text retrieval queries with a clause such as
 ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller
 and full-text retrieval therefore much faster. The language codes
 would be mapped somehow to standard values recognized by BaseX in the
 language_code_map file.

 Are there any efforts towards such a feature?



Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Goetz Heller
Thanks, Fabrice!
I’ll work it out.
 
Kind regards,
 
Goetz
 
 
Von: Fabrice Etanchaud [mailto:fetanch...@questel.com] 
Gesendet: Mittwoch, 22. April 2015 11:32
An: Goetz Heller; basex-talk@mailman.uni-konstanz.de
Betreff: RE: [basex-talk] multi-language full-text indexing
 
Great, Goetz !
 
A last thing :
If you need to rebuild the original document from parts, be sure to have a
way to retrieve them all (by document path, attribute index, or separate
index collection with node-id/pre values).
 
If disk space is not an issue, you could store the original document as it
is, and create localized collection for full text indexing purposes.
 
Hoping it helps,
 
Best regards,
Fabrice
 
De : basex-talk-boun...@mailman.uni-konstanz.de
mailto:basex-talk-boun...@mailman.uni-konstanz.de
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz
Heller
Envoyé : mercredi 22 avril 2015 11:20
À : basex-talk@mailman.uni-konstanz.de
mailto:basex-talk@mailman.uni-konstanz.de 
Objet : Re: [basex-talk] multi-language full-text indexing
 
Fabrice,
For the time being, this sounds quite nice. I’d to split up the files in
some common part and a set of “satellites”, one satellite for each language
present in the document.
 
Thanks!
 
Kind regards,
 
Goetz
 
Von: Fabrice Etanchaud [mailto:fetanch...@questel.com] 
Gesendet: Mittwoch, 22. April 2015 11:04
An: Goetz Heller; basex-talk@mailman.uni-konstanz.de
mailto:basex-talk@mailman.uni-konstanz.de 
Betreff: RE: [basex-talk] multi-language full-text indexing
 
Dear Goetz,
 
I have the same requirement (patent documents containing text in different
languages).
I ended up splitting/filtering each original document in localized parts
inserted in different collections (each collection having its own full text
index configuration).
BaseX is as flexible as our data !
 
Best regards,
 
 
De : basex-talk-boun...@mailman.uni-konstanz.de
mailto:basex-talk-boun...@mailman.uni-konstanz.de
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz
Heller
Envoyé : mercredi 22 avril 2015 10:50
À : basex-talk@mailman.uni-konstanz.de
mailto:basex-talk@mailman.uni-konstanz.de 
Objet : [basex-talk] multi-language full-text indexing
 
I’m working with documents destined to be consumed anywhere in the European
Community. Many of them have the same tags multiple times but with a
different language attribute. It does not make sense to create a full-text
index for the whole of these documents therefore. It is desirable to have
documents indexed by locale-specific parts, e.g.
 
CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/LOCALIZED_PART_A[@LANG=$lang],
(path_b)/LOCALIZED_PART_B[@LG=$lang],…
) FOR LANGUAGE $lang IN (
BG,
DN,
DE WITH STOPWORDS filepath_de WITH STEM = YES,
EN WITH STOPWORDS filepath_en,
FR, …
)  [USING language_code_map]
and then to write full-text retrieval queries with a clause such as ‘FOR
LANGUAGE BG’, for example. The index parts would be much smaller and
full-text retrieval therefore much faster. The language codes would be
mapped somehow to standard values recognized by BaseX in the
language_code_map file. 
Are there any efforts towards such a feature?


Re: [basex-talk] Creation of Full-Text-Index failed

2015-04-22 Thread Christian Grün
...enjoy the fixed version [1].
Christian

[1] http://files.basex.org/releases/latest


On Tue, Apr 21, 2015 at 8:56 PM, Goetz Heller hel...@hellerim.de wrote:
 For the task at hand I need to create a database on a daily base from file
 packages I received. The language taken here is German, however the files
 contain lots of international characters as well. Usually this does not
 harm, and I don’t know if this is the real cause of failure in this case.

 Actually, the database was created, but an error message occurred which was
 not very specific: “file xxx could not be parsed”. File “xxx” was the last
 file of the package, and it was accessible for xQuery search. However, no
 full-text index was created as with the other packages. Trying to create the
 index directly resulted in a different message: “Improper use? … Stack
 Trace: java.lang.ArrayIndexOutOfBoundsException”.



 The package can be downloaded from
 http://www.hellerim.de/downloads/BaseX/20150203_023.7z.



 This does not look like a problem with the data but rather like a bug in
 BaseX. If I’m wrong, however, I would prefer to get a message which points
 me to the problem so I can try to solve it.



 Kind regards,



 Goetz








[basex-talk] multi-language full-text indexing

2015-04-22 Thread Goetz Heller
I'm working with documents destined to be consumed anywhere in the European
Community. Many of them have the same tags multiple times but with a
different language attribute. It does not make sense to create a full-text
index for the whole of these documents therefore. It is desirable to have
documents indexed by locale-specific parts, e.g.
 
CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/LOCALIZED_PART_A[@LANG=$lang],
(path_b)/LOCALIZED_PART_B[@LG=$lang],.
) FOR LANGUAGE $lang IN (
BG,
DN,
DE WITH STOPWORDS filepath_de WITH STEM = YES,
EN WITH STOPWORDS filepath_en,
FR, .
)  [USING language_code_map]
and then to write full-text retrieval queries with a clause such as 'FOR
LANGUAGE BG', for example. The index parts would be much smaller and
full-text retrieval therefore much faster. The language codes would be
mapped somehow to standard values recognized by BaseX in the
language_code_map file. 
Are there any efforts towards such a feature?


Re: [basex-talk] IllegalMonitorStateException at org.basex.core.locks.DBLocking

2015-04-22 Thread Simon Chatelain
Hello,

Excellent. Glad to be of use.

I'll try the new snapshot right away.

Cheers

Simon

On Wed, Apr 22, 2015 at 10:06 AM, Christian Grün christian.gr...@gmail.com
wrote:

 Hi Simon,

 I finally had time to look at your examples, and...

  One more detail: [...]

 ...seemed to fix it! The original version of this class was written by
 Jens (in the cc), but I also believe that the basic problem was that
 the locks instance was not synchronized. In my fix, I used a
 ConcurrentHashMap instance and changed other minor things in the code
 (see [1], 8.2 branch).

 A new snapshot is available, too..

 Thanks for the helpful feedback!
 Christian

 [1]
 https://github.com/BaseXdb/basex/commit/d3503d36325cb0fea58a13ee9f54feb5ce8868a6
 [2] http://files.basex.org/releases/latest


 if in the unsetLockIfUnused() method in the DBLocking
  class, I put the locks.remove(object) call in a synchronized(locks){}
 block,
  the problem does not appear any more, but as I don't understand exactly
 what
  the problem is, I am not sure if it really solves it or if it just change
  the timing a little bit making the problem less likely to happen.
 
  Regards
 
  Simon
 
 
 
 
  Hello Christian,
 
  After some testing on my side, I didn't see the
  ConcurrentModificationException any more.
  That's the good news.
 
  However, when running a slightly modified version of the small test case
  you wrote from my sample application, I faced another problem.
  The test can run for a few seconds to almost an hour but eventually, the
  following exception is thrown.
 
  java.lang.IllegalMonitorStateException
  at
 
 java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryRelease(ReentrantReadWriteLock.java:374)
  at
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1260)
  at
 
 java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.unlock(ReentrantReadWriteLock.java:1131)
  at org.basex.core.locks.DBLocking.release(DBLocking.java:215)
  at org.basex.core.Context.unregister(Context.java:287)
  at org.basex.core.Command.execute(Command.java:105)
  at org.basex.api.client.LocalSession.execute(LocalSession.java:132)
  at org.basex.api.client.Session.execute(Session.java:36)
  at basextest.BaseXTestDBLocking$1.run(BaseXTestDBLocking.java:43)
 
  I attach my new test case (BaseXTestAdd.java), but the main modification
  is that in each created thread I also add documents to the collections
  instead of only opening it.
 
  I was also able to see that in the call to getOrCreateLock() in
  DBLocking#release(final Proc pr) (line 212) the lock is created while it
  should already be in the locks Map, but I really cannot understand how
 this
  is possible. It would mean that the lock was removed by another thread
 but
  for that the usage value must be wrong in the lockUsage map, and I
 cannot
  find any sequence of operation that could lead to such a situation.
 
  Trying to pin-point more precisely the problem, I wrote another test
  (BaseXTestDBLocking.java) that calls directly the acquire and release
  methods of the DBLocking class. The problem seems to happen more quickly
  with this test.
 
 
  Any thoughts ?
 
  Regards
 
  Simon
 
 



Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Fabrice Etanchaud
Dear Goetz,

I have the same requirement (patent documents containing text in different 
languages).
I ended up splitting/filtering each original document in localized parts 
inserted in different collections (each collection having its own full text 
index configuration).
BaseX is as flexible as our data !

Best regards,


De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz Heller
Envoyé : mercredi 22 avril 2015 10:50
À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] multi-language full-text indexing

I'm working with documents destined to be consumed anywhere in the European 
Community. Many of them have the same tags multiple times but with a different 
language attribute. It does not make sense to create a full-text index for the 
whole of these documents therefore. It is desirable to have documents indexed 
by locale-specific parts, e.g.

CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/LOCALIZED_PART_A[@LANG=$lang],
(path_b)/LOCALIZED_PART_B[@LG=$lang],...
) FOR LANGUAGE $lang IN (
BG,
DN,
DE WITH STOPWORDS filepath_de WITH STEM = YES,
EN WITH STOPWORDS filepath_en,
FR, ...
)  [USING language_code_map]
and then to write full-text retrieval queries with a clause such as 'FOR 
LANGUAGE BG', for example. The index parts would be much smaller and full-text 
retrieval therefore much faster. The language codes would be mapped somehow to 
standard values recognized by BaseX in the language_code_map file.
Are there any efforts towards such a feature?


Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Christian Grün
 It is desirable to have
 documents indexed by locale-specific parts, e.g.

I can see that this would absolutely make sense, but it would be quite
some effort to realize it. There are also various conceptul issues
related to XQuery Full Text: If you don't specify the language in the
query, we'd need to dynamically decide what stemmers to use for the
query strings, depending on the nodes that are currently targeted.
This would pretty much blow up the existing architecture.

As there are so many other types of index structures that could be
helpful, depending on the particular use case, we usually recommend
users to create additional BaseX databases, which can then serve as
indexes. This can all be done in XQuery. I remember there have been
various examples for this on this mailing list (see e.g. [1,2]).

Hope this helps,
Christian

[1] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
[2] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html







 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

 (path_a)/LOCALIZED_PART_A[@LANG=$lang],

 (path_b)/LOCALIZED_PART_B[@LG=$lang],…

 ) FOR LANGUAGE $lang IN (

 BG,

 DN,

 DE WITH STOPWORDS filepath_de WITH STEM = YES,

 EN WITH STOPWORDS filepath_en,

 FR, …

 )  [USING language_code_map]

 and then to write full-text retrieval queries with a clause such as ‘FOR
 LANGUAGE BG’, for example. The index parts would be much smaller and
 full-text retrieval therefore much faster. The language codes would be
 mapped somehow to standard values recognized by BaseX in the
 language_code_map file.

 Are there any efforts towards such a feature?


Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Christian Grün
In a nutshell: It would take some more time to explain all the
implications.. We know that there are various non-trivial issues to be
solved, as we already thought about adding such an index some years
ago.

Cheers,
Christian


On Wed, Apr 22, 2015 at 11:15 AM, Goetz Heller hel...@hellerim.de wrote:
 The case you described should be made a non-issue:
 If a multi-language full-text index was created then it was surely intended 
 to execute searches within the confines of a specific language. Hence, if 
 none was specified in the query, a runtime error should be thrown in such 
 cases.

 Kind regards,

 Goetz

 -Ursprüngliche Nachricht-
 Von: Christian Grün [mailto:christian.gr...@gmail.com]
 Gesendet: Mittwoch, 22. April 2015 11:03
 An: Goetz Heller
 Cc: BaseX
 Betreff: Re: [basex-talk] multi-language full-text indexing

 It is desirable to have
 documents indexed by locale-specific parts, e.g.

 I can see that this would absolutely make sense, but it would be quite some 
 effort to realize it. There are also various conceptul issues related to 
 XQuery Full Text: If you don't specify the language in the query, we'd need 
 to dynamically decide what stemmers to use for the query strings, depending 
 on the nodes that are currently targeted.
 This would pretty much blow up the existing architecture.

 As there are so many other types of index structures that could be helpful, 
 depending on the particular use case, we usually recommend users to create 
 additional BaseX databases, which can then serve as indexes. This can all be 
 done in XQuery. I remember there have been various examples for this on this 
 mailing list (see e.g. [1,2]).

 Hope this helps,
 Christian

 [1] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
 [2] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html







 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

 (path_a)/LOCALIZED_PART_A[@LANG=$lang],

 (path_b)/LOCALIZED_PART_B[@LG=$lang],…

 ) FOR LANGUAGE $lang IN (

 BG,

 DN,

 DE WITH STOPWORDS filepath_de WITH STEM = YES,

 EN WITH STOPWORDS filepath_en,

 FR, …

 )  [USING language_code_map]

 and then to write full-text retrieval queries with a clause such as
 ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller
 and full-text retrieval therefore much faster. The language codes
 would be mapped somehow to standard values recognized by BaseX in the
 language_code_map file.

 Are there any efforts towards such a feature?



Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Fabrice Etanchaud
Great, Goetz !

A last thing :
If you need to rebuild the original document from parts, be sure to have a way 
to retrieve them all (by document path, attribute index, or separate index 
collection with node-id/pre values).

If disk space is not an issue, you could store the original document as it is, 
and create localized collection for full text indexing purposes.

Hoping it helps,

Best regards,
Fabrice

De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz Heller
Envoyé : mercredi 22 avril 2015 11:20
À : basex-talk@mailman.uni-konstanz.de
Objet : Re: [basex-talk] multi-language full-text indexing

Fabrice,
For the time being, this sounds quite nice. I'd to split up the files in some 
common part and a set of satellites, one satellite for each language present 
in the document.

Thanks!

Kind regards,

Goetz

Von: Fabrice Etanchaud [mailto:fetanch...@questel.com]
Gesendet: Mittwoch, 22. April 2015 11:04
An: Goetz Heller; 
basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de
Betreff: RE: [basex-talk] multi-language full-text indexing

Dear Goetz,

I have the same requirement (patent documents containing text in different 
languages).
I ended up splitting/filtering each original document in localized parts 
inserted in different collections (each collection having its own full text 
index configuration).
BaseX is as flexible as our data !

Best regards,


De : 
basex-talk-boun...@mailman.uni-konstanz.demailto:basex-talk-boun...@mailman.uni-konstanz.de
 [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz Heller
Envoyé : mercredi 22 avril 2015 10:50
À : 
basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] multi-language full-text indexing

I'm working with documents destined to be consumed anywhere in the European 
Community. Many of them have the same tags multiple times but with a different 
language attribute. It does not make sense to create a full-text index for the 
whole of these documents therefore. It is desirable to have documents indexed 
by locale-specific parts, e.g.

CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/LOCALIZED_PART_A[@LANG=$lang],
(path_b)/LOCALIZED_PART_B[@LG=$lang],...
) FOR LANGUAGE $lang IN (
BG,
DN,
DE WITH STOPWORDS filepath_de WITH STEM = YES,
EN WITH STOPWORDS filepath_en,
FR, ...
)  [USING language_code_map]
and then to write full-text retrieval queries with a clause such as 'FOR 
LANGUAGE BG', for example. The index parts would be much smaller and full-text 
retrieval therefore much faster. The language codes would be mapped somehow to 
standard values recognized by BaseX in the language_code_map file.
Are there any efforts towards such a feature?


Re: [basex-talk] Distributing queries to several on several processors

2015-04-22 Thread Christian Grün
Any volunteers out there? ;)


On Wed, Apr 22, 2015 at 11:05 AM, Erol Akarsu eaka...@gmail.com wrote:
 Christian,

 I think we should be able to attach BaseX to Apache spark. But integration
 code need to be written.
 Everybody is able to read from Hadoop,SOLR, ElasticSearch etc. to Spark and
 process there.
 Why not for BaseX?

 Erol Akarsu

 On Wed, Apr 22, 2015 at 4:28 AM, Christian Grün christian.gr...@gmail.com
 wrote:

 Hi Götz,

  it would
  make perfect sense to parallelize the query. Is there a way to achieve
  this
  using xQuery?

 Our initial attempts to integrate low-level support for
 parallelization in XQuery turned out not to be as successful as we
 hoped they would be. One reason for that is that you can basically do
 everything with XQuery, and it's pretty hard to detect patterns in the
 code that are simple enough to be parallelized. Next to that, Java
 does not give us enough facilities to control CPU caching behavior.

 As you already indicated, you can simply run multiple queries in
 parallel by e.g. using Java threads or the BaseX client/server
 architecture (which by default allows 8 transactions in parallel [1]).
 If your queries do a lot of I/O, you will often get better performance
 by only allowing one transaction at a time, though. This is due to the
 random access patterns on your external drives (and in my experience,
 it also applies to SSDs). However, if you work with main-memory
 instances of databases, parallelization might give you some
 performance gains (albeit not as big as you might expect).

 Hope this helps,
 Christian

 [1] http://docs.basex.org/wiki/Options#PARALLEL




[basex-talk] multi-language full-text indexing

2015-04-22 Thread Goetz Heller
The case you described should be made a non-issue:
If a multi-language full-text index was created then it was surely intended to 
execute searches within the confines of a specific language. Hence, if none was 
specified in the query, a runtime error should be thrown in such cases.

Kind regards,

Goetz

-Ursprüngliche Nachricht-
Von: Christian Grün [mailto:christian.gr...@gmail.com] 
Gesendet: Mittwoch, 22. April 2015 11:03
An: Goetz Heller
Cc: BaseX
Betreff: Re: [basex-talk] multi-language full-text indexing

 It is desirable to have
 documents indexed by locale-specific parts, e.g.

I can see that this would absolutely make sense, but it would be quite some 
effort to realize it. There are also various conceptul issues related to XQuery 
Full Text: If you don't specify the language in the query, we'd need to 
dynamically decide what stemmers to use for the query strings, depending on the 
nodes that are currently targeted.
This would pretty much blow up the existing architecture.

As there are so many other types of index structures that could be helpful, 
depending on the particular use case, we usually recommend users to create 
additional BaseX databases, which can then serve as indexes. This can all be 
done in XQuery. I remember there have been various examples for this on this 
mailing list (see e.g. [1,2]).

Hope this helps,
Christian

[1] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
[2] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html







 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

 (path_a)/LOCALIZED_PART_A[@LANG=$lang],

 (path_b)/LOCALIZED_PART_B[@LG=$lang],…

 ) FOR LANGUAGE $lang IN (

 BG,

 DN,

 DE WITH STOPWORDS filepath_de WITH STEM = YES,

 EN WITH STOPWORDS filepath_en,

 FR, …

 )  [USING language_code_map]

 and then to write full-text retrieval queries with a clause such as 
 ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller 
 and full-text retrieval therefore much faster. The language codes 
 would be mapped somehow to standard values recognized by BaseX in the 
 language_code_map file.

 Are there any efforts towards such a feature?



Re: [basex-talk] Distributing queries to several on several processors

2015-04-22 Thread Christian Grün
Hi Götz (cc @ basex-talk),

  OK, I think I understand. However, I think there should be some
possibilities to allow the user to give hints. In my opinion,
FOR-loops would be first-class candidates to use parallel streams, in
particular in the use case I described in my previous posting:

 FOR $var IN (collection)
 PARALLEL RETURN (expression-list)

Makes sense, in general.. XQuery pragmas could be solution:

  (# basex: parallel #) { ... }

Higher-order functions provide functions like hof:parallel-map(...).

However, it has many effects on the architecture of BaseX in terms of
performance, because we'd need to create new contexts for each
parallelized query, which takes additional time. See the following
query as example:

  $x[. = 123]

The dot applies to the current context item. If we parallelize a
query, we'd have multiple current context items. The same
multiplication would apply to the stack frame and other runtime
variables, and the time lost for duplicating these instances is in
most cases more expensive than doing stuff in a single thread.

At least that's our experience so far. Once again, we are happy to see
people jump into our code and show us that it can be done better..

Christian