[basex-talk] BaseX Scala Client 0.5 released, this time with documentation

2012-10-18 Thread Manuel Bernhardt
Hi all,

I finally found / forcefully took the time and documented the Scala
Client library. The new release mostly brings documentation and uses
BaseX 7.3.

We have been using it for 5 months now and it works nicely.

One interesting feature if you're dealing with queries that return
large quantities of documents (and use Scala...) is the
StreamingClientSession which doesn't cache incoming results but
directly passes them on to be consumed by the client code.


Check it out here: https://github.com/delving/basex-scala-client

Cheers,

Manuel
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Performance of Add command

2012-07-09 Thread Manuel Bernhardt
Hi again,

inserting 3M records now seems to take a lot less time - I'm running
an insertion for the past 40 minutes now and it's close to finishing
(2.8M records so far). I have the impression that it gets slower with
the amount of size still, but much less so - but I couldn't put a
finger on any particular method call with YourKit (it started with a
whopping 15K documents / second, and now is at 300 documents / second)

I'll leave the computer running and see tomorrow how much time it took
in total (and give a detail on what calls took how long), but in any
case this is a huge improvement over how it used to be, thanks a lot!

Manuel

On Mon, Jul 9, 2012 at 2:04 PM, Manuel Bernhardt
bernhardt.man...@gmail.com wrote:
 Hi Christian,

 thanks for the fix! I'll test it right away on a big import.

 We don't have that many namespaces in those documents but the general
 idea is to keep them, so we won't be using the STRIPNS feature for the
 time being (though we might in the future, depending on the use-case)

 Thanks,

 Manuel

 On Sat, Jul 7, 2012 at 4:45 PM, Christian Grün
 christian.gr...@gmail.com wrote:
 …the problem should now be fixed. I'd be glad if you could once more
 test the import you've been discussing in your report with the latest
 code base/snapshot.

 Thanks in advance,
 Christian
 ___

 On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt
 bernhardt.man...@gmail.com wrote:
 Hi,

 I'm doing some testing before migration one of our customers to a new
 version of our platform that uses BaseX in order to store documents.
 They have approx. 4M documents, and I'm running an import operation on
 a 1 M document collection on my laptop.

 The way I'm inserting documents is by firing off one Add command per
 document, based on a stream of the document, at a different (unique)
 path for each document, and flushing every at 10K Adds.

 Since most CPU usage (for one of the cores, the other ones being
 untouched) is taken by the BaseX server, I fired up YourKit out of
 curiosity to see where the CPU time was spent. My machine is a 2*4
 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it
 should do pretty fine.

 YourKit shows that what seems to use up most time is the
 Namespaces.update method:

 Thread-12 [RUNNABLE] CPU time: 2h 7m 9s
 org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set)
 org.basex.data.Namespaces.update(int, int, boolean, Set)
 org.basex.data.Data.insert(int, int, Data)
 org.basex.core.cmd.Add.run()
 org.basex.core.Command.run(Context, OutputStream)
 org.basex.core.Command.exec(Context, OutputStream)
 org.basex.core.Command.execute(Context, OutputStream)
 org.basex.core.Command.execute(Context)
 org.basex.server.ClientListener.execute(Command)
 org.basex.server.ClientListener.add()
 org.basex.server.ClientListener.run()


 I'm not really sure what that method does - it's a recursive function
 and seems to be triggered by Data.insert:

 // NSNodes have to be checked for pre value shifts after insert
 nspaces.update(ipre, dsize, true, newNodes);

 The whole set of records should have no more than 5 different
 namespaces in total. Thus I'm wondering if there would perhaps be some
 potential for optimization here? Note that I'm completely ignorant as
 to what the method does and what its exact purpose is.

 Thanks,

 Manuel

 PS: the import is now finished: Storing 1001712 records into BaseX
 took 9285008 ms
 ___
 BaseX-Talk mailing list
 BaseX-Talk@mailman.uni-konstanz.de
 https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Performance of Add command

2012-07-09 Thread Manuel Bernhardt
(Context, OutputStream)
org.basex.core.Command.exec(Context, OutputStream)
org.basex.core.Command.execute(Context, OutputStream)
org.basex.core.Command.execute(Context)
org.basex.server.ClientListener.execute(Command)
org.basex.server.ClientListener.add()
org.basex.server.ClientListener.run()


Thread-30 --- Frozen for at least 51s
org.basex.index.resource.Docs.insert(int, Data)
org.basex.index.resource.Resources.insert(int, Data)
org.basex.data.Data.insert(int, int, Data)
org.basex.core.cmd.Add.run()
org.basex.core.Command.run(Context, OutputStream)
org.basex.core.Command.exec(Context, OutputStream)
org.basex.core.Command.execute(Context, OutputStream)
org.basex.core.Command.execute(Context)
org.basex.server.ClientListener.execute(Command)
org.basex.server.ClientListener.add()
org.basex.server.ClientListener.run()



I'm not exactly sure which of the above are relevant, but I thought
I'd share them anyway. I'll try to get some better measurements
tomorrow.


Manuel


On Mon, Jul 9, 2012 at 11:25 PM, Manuel Bernhardt
bernhardt.man...@gmail.com wrote:
 Hi again,

 inserting 3M records now seems to take a lot less time - I'm running
 an insertion for the past 40 minutes now and it's close to finishing
 (2.8M records so far). I have the impression that it gets slower with
 the amount of size still, but much less so - but I couldn't put a
 finger on any particular method call with YourKit (it started with a
 whopping 15K documents / second, and now is at 300 documents / second)

 I'll leave the computer running and see tomorrow how much time it took
 in total (and give a detail on what calls took how long), but in any
 case this is a huge improvement over how it used to be, thanks a lot!

 Manuel

 On Mon, Jul 9, 2012 at 2:04 PM, Manuel Bernhardt
 bernhardt.man...@gmail.com wrote:
 Hi Christian,

 thanks for the fix! I'll test it right away on a big import.

 We don't have that many namespaces in those documents but the general
 idea is to keep them, so we won't be using the STRIPNS feature for the
 time being (though we might in the future, depending on the use-case)

 Thanks,

 Manuel

 On Sat, Jul 7, 2012 at 4:45 PM, Christian Grün
 christian.gr...@gmail.com wrote:
 …the problem should now be fixed. I'd be glad if you could once more
 test the import you've been discussing in your report with the latest
 code base/snapshot.

 Thanks in advance,
 Christian
 ___

 On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt
 bernhardt.man...@gmail.com wrote:
 Hi,

 I'm doing some testing before migration one of our customers to a new
 version of our platform that uses BaseX in order to store documents.
 They have approx. 4M documents, and I'm running an import operation on
 a 1 M document collection on my laptop.

 The way I'm inserting documents is by firing off one Add command per
 document, based on a stream of the document, at a different (unique)
 path for each document, and flushing every at 10K Adds.

 Since most CPU usage (for one of the cores, the other ones being
 untouched) is taken by the BaseX server, I fired up YourKit out of
 curiosity to see where the CPU time was spent. My machine is a 2*4
 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it
 should do pretty fine.

 YourKit shows that what seems to use up most time is the
 Namespaces.update method:

 Thread-12 [RUNNABLE] CPU time: 2h 7m 9s
 org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set)
 org.basex.data.Namespaces.update(int, int, boolean, Set)
 org.basex.data.Data.insert(int, int, Data)
 org.basex.core.cmd.Add.run()
 org.basex.core.Command.run(Context, OutputStream)
 org.basex.core.Command.exec(Context, OutputStream)
 org.basex.core.Command.execute(Context, OutputStream)
 org.basex.core.Command.execute(Context)
 org.basex.server.ClientListener.execute(Command)
 org.basex.server.ClientListener.add()
 org.basex.server.ClientListener.run()


 I'm not really sure what that method does - it's a recursive function
 and seems to be triggered by Data.insert:

 // NSNodes have to be checked for pre value shifts after insert
 nspaces.update(ipre, dsize, true, newNodes);

 The whole set of records should have no more than 5 different
 namespaces in total. Thus I'm wondering if there would perhaps be some
 potential for optimization here? Note that I'm completely ignorant as
 to what the method does and what its exact purpose is.

 Thanks,

 Manuel

 PS: the import is now finished: Storing 1001712 records into BaseX
 took 9285008 ms
 ___
 BaseX-Talk mailing list
 BaseX-Talk@mailman.uni-konstanz.de
 https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Performance of Add command

2012-07-05 Thread Manuel Bernhardt
Hi,

On Mon, Jul 2, 2012 at 10:42 AM, Christian Grün
christian.gr...@gmail.com wrote:
 Another note: if your initial database is empty, and if your documents
 to be added are stored on disk, the operation will be much faster if
 you specify this directory along with the create command.

I had considered looking at this, but in our situation the source is a
stream that gets converted on the fly and then sent to the server
(which is on a different server than the one doing the inserts). Btw,
is there a reason why inserting from a file is faster than from a
stream? I'd expect both to use the same insertion mechanism.

Thanks,

Manuel



 great, thanks! If there's anything I can do to help, let me know.
 Right now I think I'm going to abort the import because it probably
 will take somewhat longer.

 Manuel

 On Mon, Jul 2, 2012 at 3:11 AM, Christian Grün
 christian.gr...@gmail.com wrote:
 Hi Manuel,

 sorry for the delayed feedback, and thanks for pointing to the
 Namespaces.update() method, which in fact updates the hierarchical
 namespaces structures in a database (well, you guessed that already…).
 As we first need to do some more research on potential optimizations,
 I have created a new GitHub issue to keep track of this bottleneck
 [1].

 Thanks,
 Christian

 [1] https://github.com/BaseXdb/basex/issues/523
 ___

 On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt
 bernhardt.man...@gmail.com wrote:
 Hi,

 I'm doing some testing before migration one of our customers to a new
 version of our platform that uses BaseX in order to store documents.
 They have approx. 4M documents, and I'm running an import operation on
 a 1 M document collection on my laptop.

 The way I'm inserting documents is by firing off one Add command per
 document, based on a stream of the document, at a different (unique)
 path for each document, and flushing every at 10K Adds.

 Since most CPU usage (for one of the cores, the other ones being
 untouched) is taken by the BaseX server, I fired up YourKit out of
 curiosity to see where the CPU time was spent. My machine is a 2*4
 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it
 should do pretty fine.

 YourKit shows that what seems to use up most time is the
 Namespaces.update method:

 Thread-12 [RUNNABLE] CPU time: 2h 7m 9s
 org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set)
 org.basex.data.Namespaces.update(int, int, boolean, Set)
 org.basex.data.Data.insert(int, int, Data)
 org.basex.core.cmd.Add.run()
 org.basex.core.Command.run(Context, OutputStream)
 org.basex.core.Command.exec(Context, OutputStream)
 org.basex.core.Command.execute(Context, OutputStream)
 org.basex.core.Command.execute(Context)
 org.basex.server.ClientListener.execute(Command)
 org.basex.server.ClientListener.add()
 org.basex.server.ClientListener.run()


 I'm not really sure what that method does - it's a recursive function
 and seems to be triggered by Data.insert:

 // NSNodes have to be checked for pre value shifts after insert
 nspaces.update(ipre, dsize, true, newNodes);

 The whole set of records should have no more than 5 different
 namespaces in total. Thus I'm wondering if there would perhaps be some
 potential for optimization here? Note that I'm completely ignorant as
 to what the method does and what its exact purpose is.

 Thanks,

 Manuel

 PS: the import is now finished: Storing 1001712 records into BaseX
 took 9285008 ms
 ___
 BaseX-Talk mailing list
 BaseX-Talk@mailman.uni-konstanz.de
 https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Performance of Add command

2012-07-02 Thread Manuel Bernhardt
Hi,

a little update on this: I started the import of 3M documents last
evening using this method, and after 9h it's not yet finished (at
2,29M documents atm.). So this operation looks a lot like it is in
o(n^2) (the insertion of 1M record took somewhat above 2h)

Manuel

On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt
bernhardt.man...@gmail.com wrote:
 Hi,

 I'm doing some testing before migration one of our customers to a new
 version of our platform that uses BaseX in order to store documents.
 They have approx. 4M documents, and I'm running an import operation on
 a 1 M document collection on my laptop.

 The way I'm inserting documents is by firing off one Add command per
 document, based on a stream of the document, at a different (unique)
 path for each document, and flushing every at 10K Adds.

 Since most CPU usage (for one of the cores, the other ones being
 untouched) is taken by the BaseX server, I fired up YourKit out of
 curiosity to see where the CPU time was spent. My machine is a 2*4
 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it
 should do pretty fine.

 YourKit shows that what seems to use up most time is the
 Namespaces.update method:

 Thread-12 [RUNNABLE] CPU time: 2h 7m 9s
 org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set)
 org.basex.data.Namespaces.update(int, int, boolean, Set)
 org.basex.data.Data.insert(int, int, Data)
 org.basex.core.cmd.Add.run()
 org.basex.core.Command.run(Context, OutputStream)
 org.basex.core.Command.exec(Context, OutputStream)
 org.basex.core.Command.execute(Context, OutputStream)
 org.basex.core.Command.execute(Context)
 org.basex.server.ClientListener.execute(Command)
 org.basex.server.ClientListener.add()
 org.basex.server.ClientListener.run()


 I'm not really sure what that method does - it's a recursive function
 and seems to be triggered by Data.insert:

 // NSNodes have to be checked for pre value shifts after insert
 nspaces.update(ipre, dsize, true, newNodes);

 The whole set of records should have no more than 5 different
 namespaces in total. Thus I'm wondering if there would perhaps be some
 potential for optimization here? Note that I'm completely ignorant as
 to what the method does and what its exact purpose is.

 Thanks,

 Manuel

 PS: the import is now finished: Storing 1001712 records into BaseX
 took 9285008 ms
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Performance of Add command

2012-07-02 Thread Manuel Bernhardt
Hi,

great, thanks! If there's anything I can do to help, let me know.
Right now I think I'm going to abort the import because it probably
will take somewhat longer.

Manuel

On Mon, Jul 2, 2012 at 3:11 AM, Christian Grün
christian.gr...@gmail.com wrote:
 Hi Manuel,

 sorry for the delayed feedback, and thanks for pointing to the
 Namespaces.update() method, which in fact updates the hierarchical
 namespaces structures in a database (well, you guessed that already…).
 As we first need to do some more research on potential optimizations,
 I have created a new GitHub issue to keep track of this bottleneck
 [1].

 Thanks,
 Christian

 [1] https://github.com/BaseXdb/basex/issues/523
 ___

 On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt
 bernhardt.man...@gmail.com wrote:
 Hi,

 I'm doing some testing before migration one of our customers to a new
 version of our platform that uses BaseX in order to store documents.
 They have approx. 4M documents, and I'm running an import operation on
 a 1 M document collection on my laptop.

 The way I'm inserting documents is by firing off one Add command per
 document, based on a stream of the document, at a different (unique)
 path for each document, and flushing every at 10K Adds.

 Since most CPU usage (for one of the cores, the other ones being
 untouched) is taken by the BaseX server, I fired up YourKit out of
 curiosity to see where the CPU time was spent. My machine is a 2*4
 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it
 should do pretty fine.

 YourKit shows that what seems to use up most time is the
 Namespaces.update method:

 Thread-12 [RUNNABLE] CPU time: 2h 7m 9s
 org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set)
 org.basex.data.Namespaces.update(int, int, boolean, Set)
 org.basex.data.Data.insert(int, int, Data)
 org.basex.core.cmd.Add.run()
 org.basex.core.Command.run(Context, OutputStream)
 org.basex.core.Command.exec(Context, OutputStream)
 org.basex.core.Command.execute(Context, OutputStream)
 org.basex.core.Command.execute(Context)
 org.basex.server.ClientListener.execute(Command)
 org.basex.server.ClientListener.add()
 org.basex.server.ClientListener.run()


 I'm not really sure what that method does - it's a recursive function
 and seems to be triggered by Data.insert:

 // NSNodes have to be checked for pre value shifts after insert
 nspaces.update(ipre, dsize, true, newNodes);

 The whole set of records should have no more than 5 different
 namespaces in total. Thus I'm wondering if there would perhaps be some
 potential for optimization here? Note that I'm completely ignorant as
 to what the method does and what its exact purpose is.

 Thanks,

 Manuel

 PS: the import is now finished: Storing 1001712 records into BaseX
 took 9285008 ms
 ___
 BaseX-Talk mailing list
 BaseX-Talk@mailman.uni-konstanz.de
 https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Disable or control query caching

2012-05-22 Thread Manuel Bernhardt
Hi Christian,

I just witnessed this again now. There was one processing resulting in
a streaming query (though I think there would not have been a big
difference if it would have been a cached one) over 9 records, and
we uploaded a few small collections after starting that one.
Additionally I issued a list statement from another client.

What happened next is that:
- the process with the long query went on
- the uploads were blocked (in queue)
- my call on the console was blocked (in queue)
- once the long query was done, all other operations proceeded

So it looks as though there is some kind of read lock on the server
level...? Am I perhaps doing something wrong when starting the long
query - e.g. should it be started within some kind of transaction or
special context?

Thanks,

Manuel

On Tue, May 22, 2012 at 12:18 AM, Christian Grün
christian.gr...@gmail.com wrote:
 [] There is one thing I noticed however, and that I had noticed
 earlier on as well when a big collection was being processed: any
 attempt to talk with the server seems not to be working, i.e. even
 when I try to connect via the command-line basexadmin and run a
 command such as list or open db foo, I do not get a reply. []

 I'm not quite sure what's the problem. Some questions I get in mind:

 -- does the problem occur with a single client?
 -- does no reply mean that your client request is being blocked, or
 that the returned result is empty?
 -- can you access your database via the standalone interfaces?

 Just in case.. Feel free to send a small Java example that
 demonstrates the issue.
 Christian
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Best way to insert large amounts of records

2012-05-21 Thread Manuel Bernhardt
Hi Christian,

 I'd say that your approach is close to an optimal solution, as the ADD
 command is pretty cheap, compared to e.g. REPLACE. If you believe that
 you could still run into some bottlenecks, you could have a look at,
 or provide us, with the output of Java's profiler (e.g.
 -Xrunhprof:cpu=samples),

Ok, I will look into this if we get bitten by performance issues (the
longer collections do usually take a fair amount of time to be
inserted, at least concurrently).

 - is there a performance penalty in doing this kind of parsing concurrently?

 Concurrent operations will be managed by the central transaction
 manager. At the time of writing this, all write operations are
 performed one after another, but in near future, concurrent write
 operations to different databases will also be run in parallel.


Excellent news. I noticed things were slowing down when we had
multiple collections inserted at the same time, so this should
probably help.

 - are there any JVM parameters that would help speed this up? I

 In general, Java will be faster when run with -server, but this option
 may have been chosen anyway by your Java runtime. Regarding the
 maximum amount of memory, there shouldn't be any noteworthy
 differences when adding documents.

 Hope this helps,
 Christian


Thanks!

Manuel
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Disable or control query caching

2012-05-21 Thread Manuel Bernhardt
Hi Christian,

 as you have already seen, all results are first cached by the client
 if they are requested via the iterative query protocol. In earlier
 versions of BaseX, results were returned in a purely iterative manner
 -- which was more convenient and flexible from a user's point of view,
 but led to numerous deadlocks if reading and writing queries were
 mixed.

 If you only need parts of the requested results, I would recommend to
 limit the number of results via XQuery, e.g. as follows:

  ( for $i in /record[@version =  0]
  order by $i/system/index
  return $i) [position() = 1 to 1000]


I had considered this, but haven't used that approach - yet - mainly
because I wanted to try the streaming approach first. So far our
system only used MongoDB and we are used to working with cursors as
query results, so I'm trying to keep that somehow aligned if possible.

 Next, it is important to note that the order by clause can get very
 expensive, as all results have to be cached anyway before they can be
 returned. Our top-k functions will probably give you better results if
 it's possible in your use case to limit the number of results [1].

Ok, thanks. If this becomes a problem, I'll consider using this. Is
the query time of 0.06ms otherwise the actual time the query takes to
run? If yes then I'm not too worried about query performance :)
In general, the bottleneck in our system is not so much the querying
but rather the processing of the records - I started rewriting this
one concurrently using Akka, but am now stuck with a classloader
deadlock (no pun intended). It will likely take quite some effort for
the processing to be faster than the query iteration.

 A popular alternative to client-side caching (well, you mentioned that
 already) is to overwrite the code of the query client, and directly
 process the returned results. Note, however, that you need to loop
 through all results, even if you only need parts of the results.

I implemented this and it looks like it works nicely (to be confirmed
soon  - I started a run on a 600k records collection).


Thanks for your time!

Manuel


 Hope this helps,
 Christian

 [1] http://docs.basex.org/wiki/Higher-Order_Functions_Module#hof:top-k-by
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Disable or control query caching

2012-05-21 Thread Manuel Bernhardt
Hello again,

 I implemented this and it looks like it works nicely (to be confirmed
 soon  - I started a run on a 600k records collection).

This runs nicely, in that the machine doesn't run out of memory
anymore. There is one thing I noticed however, and that I had noticed
earlier on as well when a big collection was being processed: any
attempt to talk with the server seems not to be working, i.e. even
when I try to connect via the command-line basexadmin and run a
command such as list or open db foo, I do not get a reply. I can
see the commands in the log though:

17:28:06.532[127.0.0.1:33112]   LOGIN admin OK
17:28:08.158[127.0.0.1:33112]   LIST
17:28:21.288[127.0.0.1:33114]   LOGIN admin OK
17:28:25.602[127.0.0.1:33114]   LIST
17:28:52.676[127.0.0.1:33116]   LOGIN admin OK

Could it be that the long session is blocking the output stream coming
from the server?

Thanks,

Manuel

On Mon, May 21, 2012 at 4:40 PM, Manuel Bernhardt
bernhardt.man...@gmail.com wrote:
 Hi Christian,

 as you have already seen, all results are first cached by the client
 if they are requested via the iterative query protocol. In earlier
 versions of BaseX, results were returned in a purely iterative manner
 -- which was more convenient and flexible from a user's point of view,
 but led to numerous deadlocks if reading and writing queries were
 mixed.

 If you only need parts of the requested results, I would recommend to
 limit the number of results via XQuery, e.g. as follows:

  ( for $i in /record[@version =  0]
  order by $i/system/index
  return $i) [position() = 1 to 1000]


 I had considered this, but haven't used that approach - yet - mainly
 because I wanted to try the streaming approach first. So far our
 system only used MongoDB and we are used to working with cursors as
 query results, so I'm trying to keep that somehow aligned if possible.

 Next, it is important to note that the order by clause can get very
 expensive, as all results have to be cached anyway before they can be
 returned. Our top-k functions will probably give you better results if
 it's possible in your use case to limit the number of results [1].

 Ok, thanks. If this becomes a problem, I'll consider using this. Is
 the query time of 0.06ms otherwise the actual time the query takes to
 run? If yes then I'm not too worried about query performance :)
 In general, the bottleneck in our system is not so much the querying
 but rather the processing of the records - I started rewriting this
 one concurrently using Akka, but am now stuck with a classloader
 deadlock (no pun intended). It will likely take quite some effort for
 the processing to be faster than the query iteration.

 A popular alternative to client-side caching (well, you mentioned that
 already) is to overwrite the code of the query client, and directly
 process the returned results. Note, however, that you need to loop
 through all results, even if you only need parts of the results.

 I implemented this and it looks like it works nicely (to be confirmed
 soon  - I started a run on a 600k records collection).


 Thanks for your time!

 Manuel


 Hope this helps,
 Christian

 [1] http://docs.basex.org/wiki/Higher-Order_Functions_Module#hof:top-k-by
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


[basex-talk] Best way to insert large amounts of records

2012-05-18 Thread Manuel Bernhardt
Hi,

we're using BaseX to store multiple collections of documents (we call
them records).

These record are produced programmatically, by parsing an incoming
stream on a server application and turning it into a document of the
kind

record id=123 version=1
...
/record

So far I took the following approach:

- each collection of records is its own database in BaseX, for easier management

- on insertion
  - set the session's autoflush to false
  - iterate over record
  - add them via add(id, document)
  - each 1 records, flush
  - finally, flush once more
  - create the attributes index


So for example now we have:

name   Resources  Size
   Input Path

col1  14141  19815190
col2   14750  16697081
col384450  253593687
col4 10124772107593252
col5  126058 186315175
col6 13767  14640701
col7815991 730536864
col8 31189  39598405
col9 24733  91277637
col10  171906 202392553
...

and there'll be quite a bit more coming in.

This kind of bulk insertion can also happen concurrently (I've set-up
an actor pool at five for the moment).


My questions are:

- is this the most performant approach, or would it make sense to e.g.
build one stream on the fly and somehow turn it into an inputstream to
be sent via add?
- is there a performance cost in adding with an ID? We don't really
need them since we retrieve records via a query - and those resources
aren't really files on the file-system
- is there a performance penalty in doing this kind of parsing concurrently?
- are there any JVM parameters that would help speed this up? I
haven't quite found how to pass in JVM parameters when starting
basexserver via the command line. Looks like BaseX gave itself an Xmx
of 1866006528 (but that machine has 8GB so it could in theory get
more.

Thanks!

Manuel
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


[basex-talk] Scala client library for BaseX

2012-05-18 Thread Manuel Bernhardt
Hi,

I'd like to announce the first release of a scala client library for
BaseX, which simplifies the idiomatic usage of BaseX within Scala
applications.

It's likely going to evolve quite a bit over the next weeks since
we're in the process of learning how to best use BaseX.

The source is available here: https://github.com/delving/basex-scala-client


And I'll try to find some time to write some documentation this week-end.

Comments, feedback etc. are of course very welcome!



Cheers,

Manuel
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


[basex-talk] Preferred way to run BaseX as service on Debian

2012-05-10 Thread Manuel Bernhardt
Hi,

is there perhaps an init.d script somewhere already in order to launch
basexserver as a service on Debian?

So far it looks as though there isn't one in the Debian package, so
I'm thinking of adding a line to rc.local to run it on startup.

Also, from what I gathered, basex is now only available in sid, is
that correct? I installed it on squeeze by downloading the deb,
there's just one dependency on java-wrappers that I needed to install
by hand.


Thanks,

Manuel
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Preferred way to run BaseX as service on Debian

2012-05-10 Thread Manuel Bernhardt
Hi,

Thanks for the fast answers!

Yes, that init.d script looks like what I was looking for. If I
understand correctly now, the data is going to be stored in the
BaseXData directory of the user who launched the service?

Another thing that came to mind: is there perhaps a way by which
basexserver could be bound only to the loopback address (or to some
configured IP address)? I remember this being possible in e.g. mysql,
and it is a quite nice way of securing the server (since someone could
arguably try to brute-force access if knowing the port on the machine,
or using the default port). Of course an alternative is to set-up a
rule in iptables.

Also, I haven't quite found yet where the username and password are
being configured - do I need to create a configuration file for this?

Some additional ideas for the Debian package:
- have it set-up a basex user, which is the default user running basex-server
- set the data directory for that server to be e.g. in /var/lib/basex,
to be more in line with Debian's default behavior

Thanks,

Manuel

On Thu, May 10, 2012 at 4:06 PM, Christian Grün
christian.gr...@gmail.com wrote:
 Hi Manuel,

 thanks for your input; at times, there were some online references to
 init.d scripts for BaseX; maybe they could be of interest here?

  http://blog.neolocus.com/2012/02/basex-xml-server-as-a-linux-service/
  http://cubeb.blogspot.com/2011/07/basex_23.html

 Christian
 ___

 On Thu, May 10, 2012 at 3:47 PM, Alexander Holupirek a...@holupirek.de 
 wrote:
 Hi Manuel,

 On 10.05.2012, at 15:22, Manuel Bernhardt wrote:

 is there perhaps an init.d script somewhere already in order to launch
 basexserver as a service on Debian?

 no, not yet, but good idea. I filed an issue for that [1]

 So far it looks as though there isn't one in the Debian package, so
 I'm thinking of adding a line to rc.local to run it on startup.

 +1

 Also, from what I gathered, basex is now only available in sid, is
 that correct? I installed it on squeeze by downloading the deb,
 there's just one dependency on java-wrappers that I needed to install
 by hand.

 the current version is available in sid and, since yesterday, in testing.
 right, java-wrappers are the only dependency. libtagsoup-java might be of 
 interest
 if you want to process non-wellformed HTML.

 providing the latest version as squeeze-backport is a good idea as well 
 (filed
 another issue [2]

 Thanks,
        Alex

 [1] https://github.com/BaseXdb/basex/issues/499
 [2] https://github.com/BaseXdb/basex/issues/500
 ___
 BaseX-Talk mailing list
 BaseX-Talk@mailman.uni-konstanz.de
 https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk