from:"Thomas Müller"

Re: [VOTE] Release Apache Jackrabbit 2.0.3

2010-10-27 Thread Thomas Müller

+1

Re: [VOTE] Release Apache Jackrabbit 2.1.2

2010-10-27 Thread Thomas Müller

+1

Regards,
Thomas

Re: [jr3] Clustering: Scalable Writes / Asynchronous Change Merging

2010-10-21 Thread Thomas Müller

Hi,

 Network delay .. is faster than the delay of a disk

I wrote the network is the new disk (in terms of bottleneck, in
terms of performance problem). Network delay may be a bit faster now
than disk access. But it's *still* a huge bottleneck (compared to
in-memory operations), specially if cluster nodes are far apart. If
you have the complete repository in-memory, but for each operation you
need network access, then the network is the bottleneck. And I don't
want that to be the bottleneck. I know I repeat myself, but it looks
like this was not clear.

 importance of leveraging in-memory storage

In-memory storage *alone* is fast. But if used in combination with the
current clustering architecture, then writes will not scale. They will
be just be a bit faster (until you reach the network delay wall).

 What is then the reason for asynchronous change merging, if not for
 performance?

Where did I say it's not about performance? As I already wrote: it's
about how to manage cluster nodes that are relatively far apart.

 observation listeners would not always get notified in the same order

Regular observation listeners are not necessarily the problem, we
could just delay them until the pre-defined sync delay (until things
are in sync). The problem are *synchronous* event listeners (as I
already wrote). The JCR API doesn't actually define them as far as I
know.

Regards,
Thomas

Re: [jr3] Clustering: Scalable Writes / Asynchronous Change Merging

2010-10-21 Thread Thomas Müller

Hi,

 See section 7 Vector Time. Also see [1] from slide 14 onwards for a more
 approachable reference.
 [1] http://www.cambridge.org/resources/0521876346/6334_Chapter3.pdf

Thanks! From what I read so far it sounds like my idea is called Time
Warp / Virtual Time.

On page 10 and 11 there is the notion of Total Ordering: The main
problem in totally ordering events is that two or more events at
different processes may have identical timestamp. - A tie-breaking
mechanism is needed to order such events. - Process identifiers are
linearly ordered and tie among events with identical scalar timestamp
is broken on the basis of their process identifiers. This is what I
meant with + clusterNodeId

Vector time and Matrix time: I think it would need too much memory -
if the dimension of vector clocks is the number of cluster nodes, then
the number of dimensions would change whenever you add a cluster node.

Time Warp matches my suggestion: Page 33 says Time Warp relies on the
general lookahead-rollback mechanism where each process executes
without regard to other processes having synchronization conflicts. -
it sounds like my proposal (I specially like the term Time Warp).
If a conflict is discovered, the offending processes are rolled back
to the time just before the conflict and executed forward along the
revised path. Virtual time is implemented a collection of several
loosely synchronized local virtual clocks.

Regards,
Thomas

Re: [jr3] Clustering: Scalable Writes / Asynchronous Change Merging

2010-10-20 Thread Thomas Müller

Hi,

Let's discuss partitioning / sharding in another thread. Asynchronous
change merging is not about how to manage huge repositories (for that
you need partitioning / sharding), it's about how to manage cluster
nodes that are relatively far apart. I'm not sure if this is the
default use case for Jackrabbit. Traditionally, asynchronous change
merging (synchronizing) is only used if the subsystems are offline for
some time, or if there is a noticeable networking delay between them,
for example if cluster nodes are in different countries.

But I don't want that the network is the new disk (in terms of
bottleneck, in terms of performance problem). Networking delay may be
the bottleneck even if cluster nodes are in the same room, specially
when you keep the whole repository in memory, or use SSDs. Also,
computers get more and more cores, and at some point message passing
is more efficient than locking.

Asynchronous operation is bad for reservation systems, banking
applications, or if you can't guarantee sticky sessions. Here you need
synchronous operations or at least locking. If you want to support
both cases in the same repository, you could use virtual repositories
(which are also good for partitioning / sharding).

My proposal is for Jackrabbit 3 only. In the extreme case, the
asynchronous change merger might very well be a separate thread and
use little more than the JCR API. Therefore asynchronous change
merging should have very little or no effect on performance if it is
not used. On the other hand, replication should likely be in the
persistence layer. I think the persistence API should be synchronous
as it is now.

 We could also use normal UUIDs or SHA1 hashes of the serialized change sets

That's an option, but lookup by node id and time must be efficient.
UUIDs / secure hashes are not that space efficient (that might not be
the problem). We see from Jackrabbit that indexing random data (UUIDs)
is extremely bad for cache locality and index efficiency, but if
indexing is done by time then that's also not a problem. The algorithm
I propose is sensitive to configuration changes, but you only need to
change the formula when going from max 256 cluster nodes to more
than 256 cluster nodes (for example). And you need a unique cluster
id. But I don't think that's the problem.

 we could leverage a virtual time algorithm

I read the paper, but I don't actually understand how to implement it.

 We'll probably need some mechanism for making the content
 of conflicting changes available for clients to review event if the
 merge algorithm chooses to discard them.

If we leave it up to the client to decide what to do, then things
might more easily run out of sync. But in any case there might be
problems, for example synchronous event listeners might get a
different order of events in different cluster nodes (possibly even
different events). Probably it would make sense to add some kind of
offline comparison / sync feature, similar to rsync. Actually that
could be useful even for Jackrabbit 2.

Regards,
Thomas

[jr3] Clustering: Scalable Writes / Asynchronous Change Merging

2010-10-19 Thread Thomas Müller

The current Jackrabbit clustering doesn't scale well for writes
because all cluster nodes use the same persistent storage. Even if
persistence storage is clustered, the cluster journal relies on
changes being immediately visible in all nodes. That means Jackrabbit
clustering can scale well for reads, however it can't scale well for
writes. This is a property Jackrabbit clustering shares with most
clustering solutions for relational databases. Still, it would make
sense to solve this problem for Jackrabbit 3.

== Current Jackrabbit Clustering ==

[Cluster Node 1]  -- | Shared
[Cluster Node 2]  -- | Storage

I propose a different architecture in Jackrabbit 3:

== Jackrabbit 3 Clustering ==

[Cluster Node 1]  --  [ Local Storage ]
[Cluster Node 2]  --  [ Local Storage ]

Please note that shared node storage is still supported for things
like the data store, but no longer required or supported for the
persistent storage (currently called PersistenceManager).

Instead, the cluster nodes should merge each others changes
asynchronously (except operations like JCR locking, plus potentially
other operations that are not that common; maybe even node move). With
asynchronously I mean usually within a second or so, but potentially
minutes later depending on configuration, latency between cluster
nodes, and possibly load. Similar to NoSQL systems.

== Unique Change Set Ids ==

For my idea to work, we need globally unique change set ids. Each
change set is stored in the event journal, and can be retrieved later
and sent to other cluster nodes. I suggest that events are grouped
into change sets so that all events within the same session.save()
operation have the same change set id. We could also call it
transaction id (I don't mind). Change set ids need to be unique across
all cluster nodes. That means, the change set id could be:

changeSetId = nanosecondsSince1970 * totalClusterNodes + clusterNodeId

Let's say if you have 2 cluster nodes currently and expect to add a
few more later (up to 10), you could use the formula:

changeSetId = nanosecondsSince1970 * 10 + clusterNodeId

To support more than 10 cluster nodes the formula would need to be
changed (that could be done at runtime). It doesn't necessarily need
to be this formula, but the change set id should represent the time
when the change occurred, and it should be unique.

== How to Merge Changes ==

Changes need to be merged so that all cluster nodes end up with the
same data (you could call this eventually consistent).

New changes are not problematic can be applied directly. This includes
local changes of course, because the change set id of local changes is
always newer than the last change.

Changes with change set ids in the future are delayed. Cluster nodes
should have reasonably synchronized clocks (it doesn't need to be
completely exact, but it should be reasonably accurate, so that such
delayed events are not that common).

So the only tricky thing are changes that happened in the past, in
another cluster node, if the same data was changed in this cluster
node (or another cluster node) afterwards (afterwards mean with a
higher change set id). To find out that a change happened in the past,
each node needs to at least know the change set id of the last change.
There are multiple solutions:

== Solution A: Node Granularity, Ignore Old Changes ==

Here, each node only need to know when it was changed the last time.
If the change set id is older than that, changes to its properties and
child node list are ignored. That means, if two cluster nodes
concurrently change data in a node, the newer change wins, and the
older change is lost. This is a bit problematic for example when
concurrently adding child nodes: Only the added child node of the
newer change survives, which is probably unexpected.

== Solution B: Merge Old Changes ==

Here, we need an efficient way to load the list of changes (events) to
a node since a certain time. Now, when merging a change, the old
versions of the node need to be loaded or re-constructed, and then the
old change needs to be applied as if it already happened before the
newer change. Let's say we know about the two versions:

v1: node a; child nodes b, c, d; properties x=1, y=2
event t9: add child node e, set property x=2, remove property y
v9: node a; child nodes b, c, d, e; properties x=2

The change to merge happend in the past:

event t3: add child node f, remove child node b, set property y=3,
remove property x, set property z=1

Now the result would be:

v9(new): node a, child nodes c, d, e, f; properties x=2, z=1

There are other ways to merge the changes of course (for example, only
merge added child / removed child nodes). I think there are some
tricky problems, however I think it's relatively easy to ensure the
algorithm is correct using a few randomized test cases. No matter what
the merge rules are, they would need to be constructed so that at the
end of the day, each cluster node would end up with the exact same

Re: Concurrent Write issues with Jackrabbit

2010-07-28 Thread Thomas Müller

Hi,

Are you sure the problem is concurrency and not performance? Are
you sure that the persistence manager you use does support higher
write throughput? What persistence manager do you use, and what is the
write throughput you see, and what do you need?

Regards,
Thomas

Re: Concurrent Write issues with Jackrabbit

2010-07-28 Thread Thomas Müller

Hi,

Do you use Day CRX / CQ? If yes, I suggest to use the Day support.

Regards,
Thomas

Re: Concurrent Write issues with Jackrabbit

2010-07-28 Thread Thomas Müller

Hi,

 What I am getting here is that writes will be
 serialized due to a Single Write lock

For scalability, you also need scalable hardware. Just using multiple
threads will not improve performance if all the data is then stored on
the same disk.

Regards,
Thomas

Re: Diagram of Jackrabbit remoting options

2010-07-21 Thread Thomas Müller

Hi,

It looks good to me too (but I'm not an expert in this area).

 i don't fully understand the distinction between 'Component' and 'Shared 
 Code',
 but apart from that, looks good to me.

I also don't understand the difference.

Regards,
Thomas

Re: bit by JCR-1558 - what are valid workarounds?

2010-07-19 Thread Thomas Müller

Hi,

I am not completely sure if that solves the problem, but could you try
sharing the repository-level FileSystem between cluster nodes? That
means, configure the first FileSystem entry in the repository.xml
(the one directly within the Repository element; not the one within
the Workspace element) in each cluster node to that it points to the
same file system?

I think clustering documentation at
http://wiki.apache.org/jackrabbit/Clustering is incorrect here. But I
didn't try it myself, so I'm not completely sure it will work. If it
doesn't work, please post the Jackrabbit version you use, the
configuration, and the file listing (of both the shared and the local
files). If it does work, please tell us as well (I will then update
the documentation).

Regards,
Thomas

Re: [jr3] Security through obscurity

2010-05-28 Thread Thomas Müller

 well, i don't ;) i don't think that a proper oo design will
 necessarily be overly complex.

Having everything convoluted just for the sake of avoiding public
implementation methods is completely unrelated to proper OO design. It
may be your understanding of proper OO design, but it's definitely not
mine. Anyway, let's see how it goes.

Could those who suggest to get rid of the public implementation
methods please submit a patch for the Jackrabbit 3 prototype? We can
discuss from there.

Regards,
Thomas

Re: [jr3] Security through obscurity

2010-05-28 Thread Thomas Müller

Hi,

I'm sorry about the tone of my mails.

I just want to avoid that we run into the trap of making Jackrabbit 3
much too complicated and complex for the sake of being modular. I
agree there shouldn't be many public implementation methods, but what
I don't want to do is add additional glue classes to avoid them.
Adding complexity to conceal bad design, and then call that good
design. I would rather have some public methods, if the overall
design is simpler, than that added indirection.

This is not just about public methods. It's also about splitting
Jackrabbit 3 into multiple projects. In my view, we should keep it one
project, and one jar file, at least for now. I believe there are far
too many projects and jar files in Jackrabbit 2.

Regards,
Thomas

[jr3] Security through obscurity

2010-05-26 Thread Thomas Müller

Do we want have public methods in the Jackrabbit 3 implementation that
can possibly be misused (if somebody casts to an implementation
class)? See the discussion at
https://issues.apache.org/jira/browse/JCR-2640

The advantage of not having public classes: people can't cast
implementation classes and call such methods. Is this really a
problem? People should use the JCR API - they are not supposed to cast
to implementation classes.

The disadvantages are: it massively complicates developing Jackrabbit
3. It complicates understanding the source code. It potentially slows
down performance. It needs more memory (potentially a lot, for example
for cached objects such a NodeImpl). It's probably not always possible
to follow this rule. It doesn't solve the problem (people can still
modify the source code of Jackrabbit, or they can call
setAccessible(true)). Wikipedia currently defines security through
obscurity as follows: a principle in security engineering, which
attempts to use secrecy  (of design, implementation, etc.) to provide
security. In my view this is such a case.

Examples of embedded repositories or databases

a) that need more than one package and use this no public methods approach:

- I don't know any. Are there any such projects?

b) that do have public methods:

- All open source Java databases I know (Apache Derby, HSQLDB, H2)
- Hibernate (well probably most projects)
- I'm sure there are many cases in Sun JDK and JRE, for example xml
packages, javac, javadoc, almost everywhere where interfaces and
implementation are distinct and multiple packages are used.

Re: [jr3] Security through obscurity

2010-05-26 Thread Thomas Müller

 Not exposing implementation details through public API
 is a basic OO design principle.
 i think with a proper design and packaging, this will not be a problem.

I don't think you talk about the same thing here.

Proper OO is using interfaces, and not casting to implementation classes.

For example, constructors. Those need to be public if you want to
construct a new object in a different package. How do you create a
org.apache.jackrabbit.j3.nodetype.NodeTypeManagerImpl from a different
package, say, org.apache.jackrabbit.j3.SessionImpl, without public
constructor or public method? Maybe there is a way to do that. For you
it may even be a proper design, modular or whatever. Like what
Jukka just made (adding an indirection class). For me, that's plain
confusing, overly complex, and bad (it's security through obscurity).
The direct way (having a public constructor) is the best solution.
That was just an example. There are many other cases, for example
org.apache.jackrabbit.j3.NodeImpl.doAddLock(..).

Regards,
Thomas

Re: [jr3] Security through obscurity

2010-05-26 Thread Thomas Müller

Hi,

If you think it's proper OO and such, could you please provide *one*
example of a larger project that does *not* have public implementation
methods?

Regards,
Thomas

Re: [jr3] Security through obscurity

2010-05-26 Thread Thomas Müller

Hi,

I completely agree with Justin.

 Package-protected

I think it does have it's use, but for more complex products it's just
not enough. Somewhat related: in Java 1.0.2 you could use private
protected: http://www.jguru.com/faq/view.jsp?EID=15576

 Security

For real security, you either need remoting, or a SecurityManager.

Regards,
Thomas

Re: [jr3] Jackrabbit 3 in trunk

2010-05-25 Thread Thomas Müller

Hi,

 objectives of the
 jr3 project is to deliver better performance than jr2 on scalability,
 concurrency, latency, etc., it would be helpful to have an automated stress
 test framework

That's true. There are already a few such test cases, but more are
required. Patches are welcome of course :-)

However I fear most people will ignore this prototype unless it is
actually usable. That's why I think adding features is important as
well. This doesn't mean the prototype needs to pass the TCK, but at
least the basic operations should work as expected.

 It's easier to fix deep architectural issues before a bunch of code has been
 written around the architecture, so the priority should be to have code that
 breaks the architecture (highlighting the weak points) before having code
 that uses the architecture (highlighting the strong points).

In other words, the architecture needs to be correct before adding
features. That's true. Probably clustering should be added before
versioning, because clustering has a higher impact on the
architecture.

Regards,
Thomas

Re: [jr3] Jackrabbit 3 in trunk

2010-05-19 Thread Thomas Müller

Hi,

 My suggestion (admittedly as a bystander) would be that the sooner people
 can start breaking it, the sooner it can get fixed, so prioritize activities
 based on first getting it to the point of breakability (rather than
 usability), and then merge.

Sorry, I don't understand what you mean exactly... Could you give an example?

Regards,
Thomas

Re: FYI: Moving session-related classes to o.a.j.core.session

2010-05-17 Thread Thomas Müller

Hi,

I'm not sure if this will help more than it will complicate things.
Disadvantages:

- Isn't almost every class at in o.a.j.core at least somewhat session related?

- If you move classes to other packages, you will have to make many
method public.

Instead of moving session related classes to a separate package, what
about moving unrelated classes to different packages? For example
TestContentLoader (test), RepositoryCopier (utilities),
SearchManager (search), NodeTypeInstanceHandler (nodetype),
RepositoryChecker (persistence), UserPerWorkspaceSecurityManager
(security), DefaultSecurityManager (security), ItemValidator
(nodetype).

Regards,
Thomas


On Mon, May 17, 2010 at 10:43 AM, Jukka Zitting jukka.zitt...@gmail.com wrote:
 Hi,

 As a part of my work on JCR-890, I'm planning to move most of the
 session-related classes from o.a.j.core to a new o.a.j.core.session
 package. This will make it easier to review and control related
 dependencies and code paths, and to ultimately guard them against
 access from concurrent threads.

 As the first step I'm simply moving the relevant classes and making
 the minor dependency changes where needed, so the functional risk
 should be low. However, the moves will likely invalidate many other
 pending jackrabbit-core changes, so please let me know if you have
 pending changes that I should wait for before I move these classes.
 Unless there's a need to wait, I'm planning to commit the changes in
 the afternoon today.

 BR,

 Jukka Zitting

Re: FYI: Moving session-related classes to o.a.j.core.session

2010-05-17 Thread Thomas Müller

Hi,

 These unrelated classes are mostly things like RepositoryImpl,
 TransientRepository, RepositoryCopier, etc. to which many external
 codebases are linking, so we can't move them.

SessionImpl is used in my applications as well.

 RepositoryImpl,
 TransientRepository

I don't think those should be or need to be moved.

Regards,
Thomas

Re: FYI: Moving session-related classes to o.a.j.core.session

2010-05-17 Thread Thomas Müller

Hi,

As far as I understand, you want to move the classes so we can add
checkstyle / PMD constraints, and more easily ensure every method call
from an external class is synchronized. I think that's fine.

Having the 'proxy' classes sounds like a solution for the backward
compatibility concerns (not the perfect solution, but a good
solution for Jackrabbit 2). For Jackrabbit 3 I hope people will not
directly cast to implementation classes any longer.

Regards,
Thomas

Re: [jr3] Jackrabbit 3 in trunk

2010-05-14 Thread Thomas Müller

Hi,

So far the prototype is not yet usable, meaning too many features
are missing, tools are missing, documentation is missing. I guess this
needs to be fixed first, so that it becomes somewhat usable (even with
limited functionality). We also need to find out how / where exactly
we want to add it in the trunk.

Regards,
Thomas

Re: Moving backwards compatibility tests to trunk

2010-04-29 Thread Thomas Müller

Hi

An alternative is: download the old Jackrabbit jar files when running
the tests (download the jar files dynamically when required, for
example to the target directory), and then load them using a custom
class loader, or create the old repository in a separate process.

While this is currently not required, it would be more flexible (can
support very large repositories, and comparing against many versions
of Jackrabbit).

The same approach can be used by migration tools (migrate a repository
from any old version of Jackrabbit to a new version).

It's just an idea (I don't have plans to implement this myself
currently). But I do have some source code:
http://code.google.com/p/h2database/source/browse/trunk/h2/src/tools/org/h2/dev/util/Migrate.java
- this standalone class migrates from an old database version to a new
version.

Regards,
Thomas

[jr3] Additional Jackrabbit interfaces in org.apache.jackrabbit.api

2010-04-20 Thread Thomas Müller

There are a few interfaces that might be interesting for all users of
Jackrabbit. Those should be in the api package (not only for OSGi).
The most important is probably:

org.apache.jackrabbit.core.observation.SynchronousEventListener

What about 'officially' supporting it, and moving it to
org.apache.jackrabbit.api? For example to
org.apache.jackrabbit.api.observation.SynchronousEventListener

Another related interface that currently doesn't exist is:
org.apache.jackrabbit.api.observation.ClusterEvent with a method
isExternal() so that you can find out if an event originated from
this or another cluster node (because in some cases you only want to
handle an event in one cluster node, not in all of them). Maybe we
would additionally need ClusterAwareEventListener and
ClusterAwareEventJournal (to avoid having to cast).

What do you think? Other interfaces? This is mainly for Jackrabbit 3,
but we might start supporting it within Jackrabbit 2.x as well if
needed.

Regards,
Thomas

Re: [jr3] Additional Jackrabbit interfaces in org.apache.jackrabbit.api

2010-04-20 Thread Thomas Müller

Hi,

 org.apache.jackrabbit.api.observation.JackrabbitEvent ?

You are right, I didn't see that... sorry... JackrabbitEvent already
has isExternal, so forget about ClusterEvent.

 org.apache.jackrabbit.api.observation.ExtendedEvent

I can't find this one.

Regards,
Thomas

Re: [jr3] Node Identifiers / Corresponding Nodes

2010-04-19 Thread Thomas Müller

Hi,

I'm wondering if the Jackrabbit 3 should support storage backends that
use the path as the identifier. It's probably possible (with some
limitations), but I'm not sure if it's necessary. I'm sure it's
inefficient, but sometimes that's not a problem.

What do others think? If we want to support it, we should decide that early on.

Regards,
Thomas

Re: [jr3] Node Identifiers / Corresponding Nodes

2010-04-19 Thread Thomas Müller

Hi,

I agree, we should concentrate on few backends. I think there are at least two:

- database (what we have now, default)
- in-memory (for testing only)

Still I will check what it takes to support path based node ids.
Currently I think it will only take one additional parameter in one
method (StorageSession.newNodeId(..., Val relPath), but not sure.
Let's see.

Regards,
Thomas

[jr3] Node Identifiers / Corresponding Nodes

2010-04-18 Thread Thomas Müller

== Node Identifier Format ==

Jackrabbit node ids are currently UUIDs. For Jackrabbit 3, I think
that embedded storage mechanisms should use a long sequence instead.
Advantages of sequences: faster to generate (nodeId = nextId++);
faster index lookup (nodes generated at around the same time have
similar ids, which improves index efficiency); needs less space
(specially when using a variable size format; see [1]). Advantages of
UUIDs: allows distributed creating of nodes. That's why the Jackrabbit
3 data format should support UUIDs as node ids: for cloud storage
mechanisms.

== JCR Node Identifier versus Internal Unique Node ID ==

The JCR API requires that corresponding nodes of different workspaces
have the same JCR identifier. The current Jackrabbit stores each
workspace separately, so that's not a problem. With Jackrabbit 3, I
would like to combine the storage of all workspaces. The problem is
that JCR node identifiers can no longer be equal the internal unique
node id. For efficient storage, the internal node id should be the
combination of the workspace id and the JCR node identifier.

One solution is: long internalUniqueNodeId = (workspaceId  40) +
jcrNodeIdentifier. The problem is: node ids in workspaces other than
workspace #0 need quite a lot of space when using a variable size
format.

My proposal is: store the workspace id at the end of the JCR node
identifier, using a variable size format (see [1]). I think in most
cases there is only 1 workspace (workspace #0). The second important
case is fewer than 16 workspaces. I suggest to support the following 4
cases:

* workspace #0: the node ids end with bit 0:
  internalUniqueNodeId = jcrNodeIdentifier  1

* workspaces #1-#15: node ids end with the bits 01:
  (jcrNodeIdentifier  6) + (workspaceId  2) + 1

* workspaces #16-#2047: node ids end with 011:
  (jcrNodeIdentifier  14) + (workspaceId  3) + 3

* workspace #2048-#268'435'455: ids end with 0111:
  (jcrNodeIdentifier  32) + (workspaceId  4) + 7

* workspace #268'435'455 and larger are not supported.

What do you think, do those constants make sense?

[1] The variable size int / long formats are used in various open
source projects such as Apache Lucene, SQLite, H2 Database Engine,
Google Protocol Buffers. It is somewhat similar to UTF-8 encoding. See
also:
http://code.google.com/p/h2database/source/browse/trunk/h2/src/main/org/h2/store/Data.java#989
http://en.wikipedia.org/wiki/Golomb_coding

== Node Without ID ==

The Jackrabbit 3 data format should support storing nodes embedded
within the parent node. The advantage is: such embedded nodes would be
stored next to each other, possibly improving read performance, and
maybe reducing storage space (both needs to be tested). The identifier
of such embedded nodes would be unique, but not stable.

Regards,
Thomas

Re: Jackrabbit 1.6.0 Write Performance

2010-04-12 Thread Thomas Müller

Hi,

 - The jackrabbit repository is accessed from our app using RMI.

Can you use the repository in embedded mode? That would help a lot.

 embedded Derby database
 We've tested using postgres

I would test the H2 database if you have time.

Regards,
Thomas

Re: Jackrabbit 1.6.0 Write Performance

2010-04-12 Thread Thomas Müller

Hi,

 With regard to concurrency, are there any plans for jackrabbit to support 
 concurrency out of the box?

If you use one session for each thread then it should already work.
It's a bug if it doesn't.

In any case I would use one session per thread, no matter if a future
version of Jackrabbit supports it or not.

Regards,
Thomas

Re: clustered environment, 2 different jvms, TransientFileFactory, storing file blobs in db

2010-04-08 Thread Thomas Müller

Hi,

Stefan is right, File.createTempFile() doesn't generate colliding
files. However, there is a potential problem with the
TransientFileFactory. Consider the following case:

- The file bin-1.tmp is created (BLOBInTempFile line 51).
- The TransientFileFactory adds a PhantomReference A in its queue.
- BLOBInTempFile.delete() or dispose() is called, the file bin-1.tmp
is deleted.
- A new file is created, and also called bin-1.tmp is created
(BLOBInTempFile line 51)
  (that's possible because File.createTempFile can re-use file names).
- The TransientFileFactory adds a second PhantomReference B in its
queue, pointing
  to a different file with the same name.
- The first (only the first) BLOBInTempFile is no longer referenced.
- The TransientFileFactory.ReaperThread gets PhantomReference A and
  deletes this file. But the file is still used and referenced (B).

I'm not sure if this is what is happening in your case, but it is a
potential problem.

Could you log a bug?

There are multiple ways to solve the problem. I think the best
solution is to not use File.createTempFile() and instead use our own
file name factory (with a random part, and an counter part).

Regards,
Thomas

Re: [jr3] MicroKernel prototype

2010-03-18 Thread Thomas Müller

Hi,

 it's too early IMO to judge whether a caching hierarchy manager is
 needed or not...
 IMO the only statement that can be made based on your comparison
 is that if the prototype with very limited functionality were slower than
 jackrabbit with a fully implemented feature set, the protoype's architecture
 would probably need to be reconsidered ;)

I agree.

 - security
 - locking
 - scalability (number of concurrent sessions and repository size)
 - transactions

OK, I will then try to implement (prototype) those features now.

 very flat hierarchies

Yes. We do want to solve that, it will affect the architecture, and we
don't have much experience yet how to best solve it. So I guess it's
also one of the features that should be implemented early.

Regards,
Thomas

Re: [jr3] MicroKernel prototype

2010-03-17 Thread Thomas Müller

Hi,

 i doubt that the results of this comparison is any way significant.

It was not supposed to be a fair comparison :-) Of course the
prototype doesn't implement all features. For example path are parsed
in a very simplistic way. I don't think the end result will be as fast
as the prototype. Still, I do hope that the missing features will not
slow down the code significantly if they are not used. And if they are
used, the penalty shouldn't be too high.

What is significant is: the prototype is not slower than the full
Jackrabbit, even without the CachingHierarchyManager. For me that's
relatively important because it would simplify the architecture. More
tests are required to check if the current architecture works well
even if there are millions of nodes and many concurrent sessions. And
it's important to add more features of course.

I'm wondering what is the *most* problematic features to verify the
architecture:

- security
- orderable child nodes
- same name siblings
- locking
- transactions
- clustering
- observation
- workspaces
- node types
- large number of child nodes
- search
- correct path parsing and lookup
- multiple sessions

 cut some features to gian performance improvement.

I'm not sure. What features could be cut?

Regards,
Thomas

Re: [jr3] MicroKernel prototype

2010-03-16 Thread Thomas Müller

Hi,

I have some early performance test results: There is a test with 3
levels of child nodes (each node 20 children)
(TestSimple.createReadNodes).

With the JDBC storage and the H2 database, this is about 14 times
faster than the Jackrabbit 2.0 trunk (0.2 seconds versus 2.9 seconds
for Jackrabbit 2.0). This is after 3 test runs. The storage space
usage is about 1/3 (2.8 MB for the prototype versus 9.5 MB for
Jackrabbit 2.0).

Regards,
Thomas

[jr3] Store journal as nodes

2010-03-12 Thread Thomas Müller

Currently the journal (cluster journal and event journal) is stored
using a separate storage mechanism.

I think it should be stored using the 'normal' storage mechanism.

Advantages:
- Simplifies the architecture (specially for clustering)
- Events and node data are in the same transaction, which improves
reliability and performance

Regards,
Thomas

Re: [jr3] Store journal as nodes

2010-03-12 Thread Thomas Müller

Hi,

 (except logging

Yes, I think SLF4J is fine

 and configuration, probably

Some information need to be available when the repository is
constructed, or at the latest when logging in: What storage backend to
use, and how to connect to the storage backend.

The rest of the configuration (fulltext index configuration for
example, workspace names, security, data store configuration, cluster
configuration, node type registry) should be in the repository (as
system nodes) in the normal case. This is to simplify the system and
to make configuration changes transactional.

There may be a ways to override that (for example when constructing
the repository object), but that should be the exception. I think it
doesn't make sense to keep the xml configuration files.

 What do you mean by 'normal' storage mechanism ?

I mean the data should be stored in the same place as the node data.
Unless we find it is a performance problem, I would try to store the
events as node bundles of some kind (possibly multiple events plus
regular nodes in the same bundle). For the micro kernel it could
look exactly like a normal node.

 Is it nodes and properties, in which case I fear further performance issues 
 in this area.

If it does turn out to be a performance problem, we will change it of course.

Regards,
Thomas

Re: [jr3] Store journal as nodes

2010-03-12 Thread Thomas Müller

Hi,

 In case of cluster db journal, the hostname of db connection.

The hostname of the database (if a database is used) and the database
name needs to be known when creating the repository object. Storing it
in a 'repository.xml' file is possible, but it's just an unnecessary
indirection. If you keep this information in the repository.xml file,
where do you store the path of the repository.xml file? If the user
name and password need to be protected (not stored as plain text) how
do you do that? Using yet another indirection (JNDI)?

I suggest to pass the database URL (or whatever storage you use) when
creating the repository object. Example (using a helper method; just
an example): RepositoryFactoryImpl.openRepository(jdbc:postgresql:repo,
user, password);

If you want to use a repository.xml file (that only contains the
database connection information) you can of course. But do you really
need an XML file for the database URL, the user name, and the
password? Specially if the user name and password are things that
normally should not be stored in a file?

Speaking about databases: do you know of a database where you need to
store the location of the database files in an XML file? I guess there
are some databases where you *can* do that, but I don't know any where
you *have to*.

 Configuration should be editable without boot the repository.
 Why?
 Again, for db store, if db host changes after repository shutdown, we
 should be able to config the repository to use a different db host.
 Like we can change in current repository.xml.

The current repository.xml file contains much more than just the
database connection settings. It contains the search index
configuration (or at least part of it), file system configuration,
cluster configuration, data store configuration, security
configuration, workspace configuration (for some the version store),
etc. All that, except for the database connection settings, can be
stored in the repository itself. Because it simplifies things.

 It's a feature of some application server to manage cluster configurations.

I don't see a problem here. They can.

 I would prefer leave the complicity out of default standalone deployment.

I like to keep things as simple as possible. The repository.xml and
workspace.xml files are not required; they actually make things more
complicated than necessary (specially, but not only, when clustering
is used).

Regards,
Thomas

Re: [jr3] Synchronized sessions

2010-03-08 Thread Thomas Müller

Hi,

 consistency. I don't know of a relational database that allows you
 to violate referential integrity, unique key constraints, or check
 constraints - simply by using the same connection in multiple threads.
 jcr repository should have some point to do the constraints check as
 well. Should fail the operation if conflict found.

The easiest way to achieve internal integrity is to synchronize on an
object. Synchronizing on the session object is much easier, and costs
much less, than

1) allowing to corrupt internal states,
2) but then somehow detect the corruption
3) and then trying to fix such problems later on.

 Performance is not major concern, it's the design.

For me, performance _is_ a major concern. But reliability is more important.

 Synchronisation should be limited and should be applied to low level where 
 necessary

There is an overhead for each synchronize. If you synchronize on a
very low level, the cost potentially higher than if you synchronize on
a higher level. Because you have to synchronize more. Please note
scalability doesn't apply in this context: if you want to do stuff
concurrently, then use multiple sessions.

 instead blindly on session for everything.

I don't suggest to synchronize blindly. I suggest to synchronize
with open eyes :-)

 Sync on session level could increase the deadlock as well.

No, the opposite: if every method is synchronized on the same object,
it will decrease the probability of deadlocks.

Regards,
Thomas

Re: [jr3] support transaction log

2010-03-04 Thread Thomas Müller

Hi,

It may slow down writes around 50%. I think it should be an optional
feature (some storage backends may not support it at all, and there
should be a way to disable / enable it for those backends that can
support it). I think we should support writing our own transaction log
even when using relational databases, but I guess it should be
possible to switch that off.

Regards,
Thomas

Re: [jr3] Micro-kernel vs. new persistence

2010-03-04 Thread Thomas Müller

Hi,

I think the persistence / storage API should be generic enough to
support at least 3 different implementations efficiently:

- an implementation based on a relational database
- a file based implementation
- in-memory

I think the storage API should support some kind of storage session
(normally one storage session for each JCR session). For a relational
database, such a session could map to a database connection.

In my view the persistence should be based on bundles (node with all
property values and with the list of child nodes) as it is now; maybe
there should be a way to combine multiple bundles into one.

Probably there should be a way to persist the transient space (only
when it doesn't fit in memory). Otherwise we would need to implement a
separate mechanism for that (using temporary files).

I think the data store API can be used as it is (maybe we can
simplify it a bit).

Regards,
Thomas

Re: [jr3] Delayed Repository Initialization

2010-03-01 Thread Thomas Müller

Hi,

 I am not clear what credentials you are refering to

I refer to the database user name and password that are currently
stored in the repository.xml (except when using JNDI):

http://jackrabbit.apache.org/api/1.5/org/apache/jackrabbit/core/persistence/bundle/BundleDbPersistenceManager.html
# param name=user value=/
# param name=password value=/

 and how current
 jackrabbit works with backend login

Currently, Jackrabbit requires to be able to create a database
connection when initializing.

 If it's related to storage backend, it need always store on repository level.

It depends on what you mean with repository level. It doesn't make
sense to store the user name and password of the database inside the
database (I hope you agree :-)

I would like to make repository.xml optional. To do that, the user
name and password for the database need to be stored somewhere else.
One solution is to provide them when creating the repository object.
Example:

String factoryClass = ...;
String url = ...?user=sapassword=xyz;
RepositoryFactory factory = (RepositoryFactory)
Class.forName(factoryClass).newInstance();
MapString, String parameters = new HashMapString, String();
parameters.put(url, url);
Repository rep = factory.getRepository(parameters);

In this case the user name and password are included in the repository
URL. This solution is almost what we have how (except there is no
repository.xml).

What I propose is: Jackrabbit should support the following use case as well:

String factoryClass = ...;
String url = ...;
RepositoryFactory factory = (RepositoryFactory)
Class.forName(factoryClass).newInstance();
MapString, String parameters = new HashMapString, String();
parameters.put(url, url);
Repository rep = factory.getRepository(parameters);
Session session = rep.login(new SimpleCredentials(sa, xyz.toCharArray()));

Here, the user name and password of the storage backend (for example a
relational database) are not included in the repository URL. Instead,
they are supplied in the first session that logs into the repository.
Currently this use case is not supported. I suggest that Jackrabbit 3
support this as a possible use case (not necessarily as the default
use case).

 Unless we designed to map jcr session user to jdbc user.

Not necessarily. The Delayed Repository Initialization is not
related to how Jackrabbit works internally. Jackrabbit might still use
only one JDBC connection for the whole repository. Or it might use a
JDBC connection pool. Or it might use one JDBC connection per session.

Regards,
Thomas

Re: [jr3] Delayed Repository Initialization

2010-03-01 Thread Thomas Müller

Hi,

Currently Jackrabbit doesn't support relayed initialization. Unless I
misunderstood Felix, he would also like to get rid of this
restriction.

Just to clarify: my suggestion is *not* about requiring the repository
is initialized when the first session is opened. It's also *not* about
requiring that the JCR credentials are used to login to the backend
storage (in most cases that's not a good idea). This idea is about
*allowing* delayed repository initialization. The examples I gave are
just for illustration and show *one* possible use case.

 Couldn't this be done by a special wrapping Repository implementation?

That's problematic. Such a wrapper would have quite some overhead. The
JCR API is not easily wrapable if you want to do it correctly: you
would have to wrap almost every JCR interface and method, including
Node and Property. That would be a relatively large memory overhead.
You could use the Java proxy mechanism, but that is relatively slow
(uses reflection).

Regards,
Thomas

[jr3] Delayed Repository Initialization

2010-02-28 Thread Thomas Müller

Currently Jackrabbit initializes the repository storage (persistence
manager) when creating the repository object. If the repository data
is stored in relational database, then the database connection is
opened at that time.

I suggest to allow delayed initialization (allow, not require). For
some storage backends, the repository could initialize when opening
the first session.

Example:

1)
String url = jdbc:...;
RepositoryFactory factory = (RepositoryFactory)
Class.forName(factoryClass).newInstance();
MapString, String parameters = new HashMapString, String();
parameters.put(url, url);
Repository rep = factory.getRepository(parameters);

2)
String user = ..., password = ...;
Session session = rep.login(new SimpleCredentials(user,
password.toCharArray()));

This example uses a relational database as the storage. When creating
the repository object, user name and password are unknown, so the
repository could not initialize at that time. Only when the first user
logs in, the user name and password are known. In this case, the user
name and password of the session would match the user name and
password of the storage backend, but that's actually not a requirement
(it's just an example).

The current Jackrabbit architecture doesn't support this 'delayed
initialization' use case yet.

I suggest that Jackrabbit 3 should support such delayed
initialization. Whether or not we will implement storage backends that
actually do use this mechanism is another question.

Regards,
Thomas

[jr3] Exception Handling

2010-02-28 Thread Thomas Müller

For Jackrabbit 3, I would like to improve exception handling. Some ideas:

== Use Error Codes ==

Currently exception message are hardcoded (in English). When using
error codes, exception messages could be translated. I'm not saying we
should translate them ourselves, but if somebody wants to, he could.

Disadvantage: it's more work to maintain, specially if Jackrabbit is
split into multiple projects. Every project could have it's own
message list, or the list could be centralized.

I'm not sure if it's worth it. What do you think?

== Include the Jackrabbit Version in Exceptions ==

This is mainly to simplify support: it's very easy to say what version
was used when somebody posts an exception message. Example:
Repository is closed [1000-3.0.1003] - this would mean error code
1000, Jackrabbit version 3.0, build 1003. The build number alone would
be enough, but for the user it may be better to also include the
version.

Also, it will allow looking at the source code without having to
download the source code of the correct version, even without having
to install an IDE. I wrote a simple JavaScript application:
http://www.h2database.com/html/sourceError.html - if you paste an
exception in the 'Input' text area, it will link to the source code
and display additional information. The source code is in a IFrame
that links to the right tag in the source repository. For example, if
you paste the following exception:

Syntax error in SQL statement SELECT * FORM[*] TEST  [42000-130]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:317)
at org.h2.message.DbException.get(DbException.java:168)
at org.h2.message.DbException.get(DbException.java:145)
at org.h2.message.DbException.getSyntaxError(DbException.java:180)
at org.h2.command.Parser.getSyntaxError(Parser.java:475)

You will be able to browse the source code in the Source Code frame.

If Jackrabbit is split into multiple projects, there would be multiple
versions. There are solutions for this, but as a start, it's easier to
just use this mechanism in one project only (Jackrabbit Core).

Regards,
Thomas

Re: [jr3] Delayed Repository Initialization

2010-02-28 Thread Thomas Müller

Hi,

 I would prefer to initialise the repository at first place and make sure 
 everything
 is correctly for repository

I wrote: *allow* delayed initialization (allow, not require).

 If user want delay the initialisation, may create the repository
 reference only when first accessed.

If the credentials are included in the repository configuration
(currently they always are; they have to be) then it's of course
possible to initialize when the repository object is created.

The question is: should Jackrabbit 3 *require* (like now) that the
credentials for the storage are included in the repository
configuration? I think for some storage backends it should not require
that. Instead (only in those cases), it should initialize the
repository when the first session is opened.

Regards,
Thomas

Re: [jr3] Synchronized sessions

2010-02-27 Thread Thomas Müller

Hi,

 jdbc connection is not thread safe. jcr session works similar way and I 
 prefer follow the same pattern.

Me too. But there is a difference between thread safety and
consistency. I don't know of a relational database that allows you
to violate referential integrity, unique key constraints, or check
constraints - simply by using the same connection in multiple threads.
See also http://en.wikipedia.org/wiki/ACID

Jackrabbit did and does have such problems (nodes that point to
non-existing parent nodes; nodes that point to non-existing child
nodes). *Those* are the problems I want to solve.

Jackrabbit shouldn't try to protect an application from storing the
wrong data. It can't. Application developers are responsible for
ensuring application level  consistency (this sentence stolen from
Wikipedia).

 To what avail?

It should never be necessary to run a consistency check or
consistency fix. It should never be necessary to delete nodes
because they are corrupt. Nodes should never get corrupt.

 programmers ... If they do not, it is their fault and they have to live with 
 the consequences of their doing the wrong thing.

Unfortunately, it's not that simple to find out whose program caused
the problem. Usually other people have to fix the problem than those
who caused them.

 But not with synchronizing all methods.

As I already wrote, if this does turn out to be a performance problem,
we can remove synchronization where required.

Regards,
Thomas

Re: [jr3] EventJournal / who merges changes

2010-02-27 Thread Thomas Müller

Hi,

 Multiple threads adding child nodes to the same parent node

Yes, that's an important use case, and should not be a problem problem
for my proposed solution.

 For instance, more than 1 thread calling 
 UserManager.createUser(userId,shardPath(useId)) where shardPath(userId) 
 results in a subtree generated from the userId to reduce contention

If we support large child node lists (automatically split using hidden
inner nodes) then your application would get simpler.

 child nodes are essentially multivalue properties

You are right, internally child nodes are stored in a similar way currently.

Regards,
Thomas

Re: [jr3] EventJournal / who merges changes

2010-02-26 Thread Thomas Müller

Hi Ian,

Could you describe your use case?

 probability of conflict when updating a multivalued property is reduced

What methods do you call, and how should the conflict be resolved?
Example: if you currently use the following code:

1) session1.getNode(test).setProperty(multi, new String[]{a, b},..);
2) session2.getNode(test).setProperty(multi, new String[]{d, e},..);
3) session1.save();
4) session2.save();

Then that would be a conflict. How would you resolve it? One option is
to silently overwrite in line 4.

Regards,
Thomas

Re: [jr3] EventJournal / who merges changes

2010-02-25 Thread Thomas Müller

There are low level merge and high level merge. A low level
merge is problematic it can result in unexpected behavior. I would
even say the way Jackrabbit merges changes currently (by looking at
the data itself, not at the operations) is problematic.

Example: Currently, orderBefore can not be done at the same time as
addNode or another orderBefore. I'm not saying this is important, but
it's one case that is complex. Another example: Let's say the low
level representation would split nodes if here are more than 1000
child nodes (add one layer of hidden internal nodes). That means
adding a node to a list of 1000 nodes could cause a (b-tree-) split.
If two sessions do that concurrently it will get messy. Session 1 will
create new internal nodes, session 2 will create new internal nodes as
well (but different ones), and merging the result will (probably)
duplicate all 1000 nodes. Or worse.

The idea is to _not_ try to merge by looking at the data, but merge by
re-applying the operation. If saving the new data fails (by looking at
the timestamp/version numbers), then refresh the data, and re-apply
the operation (orderBefore, addNode,...). This is relatively easy
to implement, and works in more cases than what Jackrabbit can do now.
Jackrabbit anyway needs to keep the EventJournal, so this is will not
use more memory.

This is not a new idea, it is how MVCC works (at least how I
understand it). From
http://en.wikipedia.org/wiki/Multiversion_concurrency_control  - if a
transaction [fails], the transaction ... is aborted and restarted.

Regards,
Thomas

Re: [jr3] Synchronized sessions

2010-02-25 Thread Thomas Müller

Hi,

 this creates a big potential for deadlocks

Could you provide an example on how such a deadlock could look like?

 just synchronizing all methods
 So you also synchronize all Node/Item/Property methods

Some methods don't need to be synchronized, for example some getter
methods such as Session.getRepository(), RangeIterator.getPosition()
and getSize(). I'm not sure if Node.getProperty needs to be
synchronized. The Value class is (almost) immutable so synchronization
is not required here. But very likely Session.getNode(..) and
Node.getNode() need to be synchronized because those potentially
modify the cache.

 ensure that for a given Item x, the same Item instance is always returned 
 from all getXXX methods 

I'm not sure what you are referring to. Jackrabbit already does ensure
the same node object is returned as far as I know, but for other
reasons than synchronization.

 if people do the wrong things, well, fine, let them do ...

It's usually not those people that have to fix broken repositories.

 my veto

Let's see.

 Most jcr apps I've seen often use a single session from several threads to 
 read from this session. (I think I also read it somewhere that this is safe 
 with jackrabbit, but I might be mistaken).

I'm not sure if this is really safe. Maybe it is problematic if one
thread uses the same session for updates.

 Simply syncing everything on the session would decrease performance in these 
 cases dramatically.

Actually, I don't think that's the case.

Regards,
Thomas

Re: [jr3] Synchronized sessions

2010-02-25 Thread Thomas Müller

Hi,

 Consider two or more threads reading different items at the same time:
 they all are chained one after the other.

Only if those threads use the same session.

 this is unsupported, yet you want to add synchronization to secure
 this unsupported case ...

When we are done it becomes a supported use case :-)

 I don't have an example off hand

Please let us know if you have one.

Regards,
Thomas

Re: [jr3] Synchronized sessions

2010-02-25 Thread Thomas Müller

Hi

 http://issues.apache.org/jira/browse/JCR-2443.

Unfortunately this bug doesn't have a test case. Also I didn't find a
thread dump that shows what the problem was exactly. I can't say what
was the problem there.

Observation is definitely an area where synchronization can
potentially lead to deadlocks. Maybe observation needs to use its own
session(s) so that it can't block. This is not a new issue however:
most writes are already synchronized (not all writes however).

I'm hesitant to change synchronization with the current
implementation: doing that would very likely lead to Java level
deadlocks. We need to make sure synchronization is always done on the
same level, and in the same order. With the current implementation,
that's challenging.

Of course performance and concurrency is very important. But the
current approach (mutable data structures, some writes are
synchronized) is quite dangerous. Instead, immutable data structures
should be used, at least for values and objects in the shared cache.
Everything else should be properly synchronized if mutable, or - if
that's too slow - the proper data structures should be used, for
example ConcurrentHashMap, CopyOnWriteArrayList, CopyOnWriteArraySet.

Regards,
Thomas

[jr3] EventJournal / who merges changes

2010-02-24 Thread Thomas Müller

== Current Behavior ==

Currently Jackrabbit tries to merge changes when two sessions
add/change/remove different properties concurrently on the same node.
As far as I understand, Jackrabbit merges changes by looking at the
data (baseline, currently stored, and new). The same for child nodes:
when two sessions add different child nodes concurrently, both child
nodes are added.

There are some problems, for example (when using b-tree mechanisms for
child nodes) when a session added child nodes that caused the child
node list to split, and a second session adds a different child node
(possibly causing a different split). For the second session it looks
like some child nodes have been removed, and it would add the child
node on the wrong (b-tree) level (in the inner node instead in the
leave node).

I think merging changes is problematic. Trying to derive the logical
operation from diffing the old and new versions is sometimes very
hard. I suggest to merge changes in a different way.

== Proposed Solution ==

When adding/changing/removing a property or node, the logical
operation should be recorded on a high level (this node was added,
this node was moved from here to there, this property was added),
first in memory, but when there are changes, it needs to be persisted
(possibly only temporarily).

When committing a transaction (usually Session.save()), the
micro-kernel tries to apply the changes. If there was a conflict, the
micro-kernel rejects the changes (it doesn't try to merge). The higher
level then has to deal with that. One way to deal with conflict
resolution is:

1) Reload the current persistent state (undo all changes, load the new data).

2) Replay the logical operations from the (in-memory or persisted) journal.

3) If that fails again, depending on a timeout, go to 1) or fail.

What I describe here is how I understand MVCC
http://en.wikipedia.org/wiki/Multiversion_concurrency_control - every
object would also have a read timestamp, and if a transaction Ti
wanted to write to object P, and the timestamp of that transaction is
earlier than the object's read timestamp (TS(Ti)  RTS(P)), the
transaction Ti is aborted and restarted. So Jackrabbit would record
the 'transaction Ti' on a higher level. If applying the changes fails
(in the micro-kernel), Jackrabbit would automatically restart this
transaction (up to a timeout).

This should also work well in a distributed environment. This case is
similar synchronizing databases.

== API ==

Instead of the current API that requires the change log to be in
memory, I suggest to use iterators:

void store(IteratorBundle newBundles, IteratorEvent events) throws
ConcurrentUpdateException

The ChangeLog consists of the new node bundles (plus, for each node
bundle, the read timestamp). The event list consists of the
EventJournal entries. For smaller operations, a session can keep the
event journal in memory. For larger operations, the session can use a
temporary file, or possibly store the data in a temporary area within
the persistence layer (maybe using a different API).

If the operation fails, the session would reload all bundles, and
re-apply the events stored in his own local event log.

Regards,
Thomas

[jr3] Synchronized sessions

2010-02-24 Thread Thomas Müller

Currently, Jackrabbit sessions are somewhat synchronized, but not
completely (for example it's still possible to concurrently read and
update data). There were some problems because of that, and probably
there still are.

I believe it's better to synchronize all methods in the session (on
the session object). This includes methods on nodes and properties and
so on. If this does turn out to be a performance problem, we can
remove synchronization where required (and where it can safely be
removed) or change the implementation (use immutable objects or safe
data structures).

This is more conservative, but I think the impact on performance will
be minimal. Of course performance is important, however I think data
consistency is more important than the possible gain of a few percents
of (read-) performance.

Regards,
Thomas

Re: [jr3] Synchronized sessions

2010-02-24 Thread Thomas Müller

Hi,

 deadlocks

I think it's relatively simple to synchronize all methods on the session.

 If we want to make sessions thread-safe, we should use proper
 implementations.

Yes, that's what I want to write: a proper implementation.

 any concurrent use of the same session is unsupported.

The disadvantage of this is that there is no way to enforce correct
usage. In some cases, incorrect usage leads to data corruption. I
believe data corrupt is not acceptable, even if the user made a
mistake.

Regards,
Thomas

[jr3] Bundle format

2010-02-20 Thread Thomas Müller

I would like to define a new storage format for nodes and properties.
A few ideas:

== Name and Namespace Index ==

Currently each new property and node name is stored in the name index.
Each namespace is stored in the namespace index. Those indexes are
used to compress the data. There are several (smaller) problems with
this:

- The indexes are stored in properties file (non-transactional).
- In the past, there were a few problems when migrating data (copying
workspaces).
- Jackrabbit indexes *each* name and namespace. This can run out of
memory if there are many names (dynamically created names).
- This is a problem for clustering (specially when using the
eventually consistent model).

I would like to keep a name index mechanism for commonly used names
and namespaces, but would also support a non-indexed names / namespace
format. I think we should start with a fixed list. We could add a
mechanism to create new index entries later on.

== Node Id ==

Currently Jackrabbit uses UUIDs to identify nodes. Even nodes that are
not referenceable have UUIDs. This allows to create nodes
concurrently, which is good. It is not optimal for storage however
(index cache efficiency is very bad because the numbers are random;
size overhead). Also, it's quite in-flexible (its hard to refer to
external nodes).

For node id storage, I suggest to support multiple data types: UUID
(which is basically a fixed length or a string), long, and string. The
Jackrabbit implementation may not need to support all formats (at
least first), but the (bundle) storage format should.

== List of Parent Node Ids ==

I would store that as a (hidden) multi-value property.

== Commonly Used String ==

If we want to store node types as regular properties, we should avoid
storing the node type strings. Instead, we should store the node type
index only. This is similar to the name index and namespace index. I
suggest the storage format supports a set of indexed values (initially
a fixed list).

Regards,
Thomas

Re: [jr3] Repository microkernel

2010-02-19 Thread Thomas Müller

Hi,

A agree with Marcel, the current Jackrabbit SPI it too high level.

 it must be impossible to create inconsistent data using the micro-kernel API
 - tree and referential integrity awareness

+1. No more consistency check / fix. No more inconsistent index if the
property-value-index is also part of the micro-kernel.

 - long running transaction support

I think 'large' transaction support is important, but it doesn't need
to be very fast. I would avoid creating temporary files (persisting
the cache). Instead, changes could be written to storage, but not
committed.

A few ideas (but I'm not convinced all of them are good ideas):

- In memory, each node points to its (main) parent. If a node is in
memory, then its parent is also.

- Nodes are immutable (in memory, and on disk). Each change (or at
least commit) will replace all parent nodes including the root node.

- No change merging. Instead, use MVCC (when reading) and 'node level
write locking'. As far as I know this is like most MVCC databases
work. Actually we could use 'property level write locking'.

- Support multiple persistence backends (database, file system,...).

- Support two phase commit and distributed transaction at a very low
level, so that it's very easy to distribute data to many storage
backends.

- Move operations are copy+delete internally (maybe reordering a node
in the list of child nodes is also a move).

- Subtrees that didn't change for a longer time are eventually
persisted as one blob in the data store in a form that is compact and
fast to read

Regards,
Thomas

Re: [jr3] Node types

2010-02-19 Thread Thomas Müller

Hi,

 Which makes observation listeners an integral part of the microkernel, btw.

The microkernel would only need to support one callback object
(listener is probably the wrong word, because it is also called for
read operations). This one would then call (and allow to register)
regular JCR observation listeners. It would also deal with / delegate
security, constraint checking like node type, and so on.

I'm not sure who should be able to write to the node type system. It
would be great if any session (with sufficient access rights) can,
because that would simplify clustering. The 'node type system' would
then just listen for changes on those nodes (and possibly revert those
changes if they don't make sense - rolling back that transaction).

Regards,
Thomas

Re: [jr3] Flat hierarchy

2010-02-19 Thread Thomas Müller

About micro-kernel: I think the micro-kernel shouldn't have to support
splitting large child lists into smaller lists. That could be done on
a higher level. What it should support is all the features required by
this mechanism:

- Support 'hidden' nodes (those are marked using a hidden property).
That means the path doesn't always map directly to the stored
structure. Therefore the micro-kernel should not be directly
responsible to build or interpret JCR paths (micro-kernel path are
similar but they don't always match the JCR path).

- The entry in the child node list may contain multiple properties (in
most cases the name, and the child node id; but sometimes also the
reference of the next child node). The number of properties for each
entry is the same however. For sorting, only the first element is
relevant.

- The child node list can always be stored in sorted order. But this
sorting doesn't always map to the JCR child node list.

Regards,
Thomas

Re: [jr3] Use JCache JSR-107 for (all) caches

2010-02-19 Thread Thomas Müller

Hi,

About clustering: there are two main use cases:

A) to improve read throughput and to achieve high availability. In
this case writes can be serialized.

B) to improve write throughput. In this case writes should not be
serialized, instead writes should be merged later on (eventually
consistent).

I guess sometime we need to support both, but personally I think A is
important as well (if not more important than B).

Regards,
Thomas

Re: [jr3] Node types

2010-02-19 Thread Thomas Müller

Hi,

This would be after the fact and wouldn't work to validate that
changes are correct (to verify added / changed nodes don't violate
node type constraints). Also it wouldn't work for security.

Regards,
Thomas

Re: [jr3] Node types

2010-02-19 Thread Thomas Müller

Hi,

 I don't see the point of doing such steps after the transaction has already 
 been committed.

Well, because you don't have a callback mechanism that gets called
_before_ committing (or reading, in the case of security).

 I'd make node type constraints and security checks the responsibility
 of the client who commits the transaction.

That's a solution :-) I'm not sure it's the _right_ solution, but we
can start like that.

Regards,
Thomas

Re: [jr3] Plugin architecture

2010-02-18 Thread Thomas Müller

Hi,

The configuration should be persisted in the repository itself. Not in
external configuration files.

 * dynamic configuration

First of all, I would define an API for configuration changes. This
API could be the regular JCR API, and the configuration could be
stored in special system nodes. On top of that API, those who want
to use OSGi can do that.

Observation listeners (called triggers in relational databases) are
currently are not part of the configuration, you always have to add
them after starting the repository. I think there should be a way to
add a persistent observation listener that is automatically started
whenever the repository is started.

 Repository and Session lifecycle listeners or transaction boundary checkers

Same as for trigger.

Regards,
Thomas

Re: [jr3] Use JCache JSR-107 for (all) caches

2010-02-18 Thread Thomas Müller

Hi,

Is Jackrabbit too slow for you? Or do you have out of memory problems?
Or why do you want to use your own cache?

 features like overflow to disk

I would try to avoid that. It's not really a 'cache' if it has to be
stored to disk, if the original data is also on disk.

I would try to solve the root cause of the problem (problems
supporting large transactions, improving performance) instead of
trying to work around the issues on some higher level.

Regards,
Thomas

[jr3] Node types

2010-02-18 Thread Thomas Müller

Currently node types are integral part of the repository. There is a
special storage mechanism (the file custom_nodetypes.xml), which is
non-transactional and problematic for clustering.

To simplify the architecture ('microkernel'), could the node type
functionality be implemented outside the kernel, as some kind of
observation listeners? The node type configuration could be stored as
regular nodes in s special tree. When registering or modifying a node
type, existing nodes may have to be updated of course. The node type
information itself could be stored in the nodes itself as a hidden
property.

Re: [jr3] Restructure Lucene indexing make use of Lucene 2.9 features

2010-02-18 Thread Thomas Müller

Hi,

Thanks for the explanation!

 index every unique jcr fieldname in a unique lucene field, and do not prefix
 values as currently is being done.

This sounds very reasonable.

Regards,
Thomas

Re: [jr3] Search index in content

2010-02-18 Thread Thomas Müller

 I'd use Lucene to manage it.

There are several problems. One is transactions, another is updating
the index synchronously. Another problem is dependence on Lucene which
is a problem for persistence and clustering.

 I would very much like to avoid inventing our own search index.

I would definitely not use a completely new mechanism. I would re-use
the repository to store the index data.

Regards,
Thomas

Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller

Hi,

I would also use a b-tree structure. If a node has too many child
nodes, two new invisible internal nodes are created, and the list of
child nodes is split up. Those internal nodes wouldn't have any
properties.

For efficient path lookup, the child node list should be sorted by
name. This is a bit tricky.

Currently, when adding a node, it is added as the last child. I
suggest to change that behavior, and add it at the right place by
default (so that the sort order is preserved). Like this, a path
lookup is very fast even if there are many child nodes (binary search
/ b-tree). Is that an acceptable change (usability and spec conform)?

If the user changes the child node order (manually re-ordered the
nodes), then the sort order is broken. Then the path lookup has to
scan through all nodes. While that's much slower, I think it's
acceptable. One alternative is to use a linked list (each child node
points to the next child node), which is very problematic for sharable
nodes.

So there would be a hidden flag 'child nodes are sorted by name'.

Regards,
Thomas

Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller

Hi,

 I would also use a b-tree structure. If a node has too many child
 nodes, two new invisible internal nodes are created, and the list of
 child nodes is split up. Those internal nodes wouldn't have any
 properties.

 You mean a b-tree for each node? I think this could be a separate
 index, but one for the whole tree.

The repository is one large b-tree, and each JCR node is a b-tree node
(except for the JCR nodes that don't have any child nodes: those are
b-tree leaves). If a JCR node has many child nodes, then there is at
least one more level of b-tree nodes between the node and the child
nodes.

 I think supporting fast path lookups for orderable child nodes is a
 bit more important than flat hierarchies

Path lookups would still be fast (the same speed as now), except for
large child node lists that were re-ordered. The difference is only
for large child node lists. There is a difference between 'orderable'
nodes (have the ability to reorder the child node list) and actually
're-ordered' child node lists. Is it acceptable if new nodes appear in
lexicographic order in the child node list?

Regards,
Thomas

Re: [jr3] Search index in content

2010-02-18 Thread Thomas Müller

Hi,

Property/value indexes: We anyway will have to implement some kind of
database persistence. Databases support transactional indexes. We
could use those instead of using Lucene. Or we could store the index
in JCR nodes (which is part of the large repository b-tree). Indexes
in databases are stored in exactly the same way.

In any case, keeping the index and the persistence in the same storage
simplifies transactional persistence a lot.

A microkernel that relies on Apache Lucene even for simple
property/value indexes is not an option in my view.

Regards,
Thomas

Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller

Hi,

A Jackrabbit repository is some kind of b-tree - just that the pages
never split and balanced automatically. Maybe using b-tree is
confusing? Let's call it manual b-tree then.

 i agree that flat hierarchies is an important feature, however we
 shouldn't compromise
 the performance for the 'standard' case with less than 1k child nodes

I agree. Using the b-tree style wouldn't slow down the standard case.
In the standard case things would stay exactly like they are now.

 add a next pointer
 This makes the data structure more complex but allows us to maintain support 
 for orderable nodes.

That's definitely an option. I just wonder if it's really required. I
guess we will find out.

Regards,
Thomas

Re: [jr3] Plugin architecture

2010-02-18 Thread Thomas Müller

 not sure that the JCR EventListener interface could be used for persistent 
 observation listeners

You are right. It would probably be a different API (to be defined).
This mechanism could be used for (just an idea):

- JCR observation
- security (filtering nodes and properties; allowing / disallowing
certain operations)
- indexing
- (maybe) register a remote repository

Regards,
Thomas

Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller

Hi,

 I think Jukka is correct that the correct use of B-trees is to use one for
 each list of child nodes, not as a way to model the entire hierarchy.

If you are more comfortable with this view, that's OK. I probably
should have said: the whole repository is a tree data structure.

 And there are modifications that can easily be applied to
 B-trees that deal with arbitrary (not based on a key) ordering of the nodes

Sure. Jackrabbit needs a way to quickly navigate to a node by path.
For that, you have to traverse the nodes and for each node you have to
find the correct child. To do that, it's better if the child node list
is ordered by name. Otherwise you have to iterate over all child nodes
until you find the right one. Or you need a secondary index (Lucene?).
And that's no matter if it's using a b-tree internally or not.

 The part that's not clear to me is how this can be efficiently combined with
 an append-only storage format that's being discussed on the [jr3] Unified
 persistence thread.  It wouldn't be good if every time a list of children
 is modified the persistence layer has to make a complete copy of the
 modified B-tree

You only have to update the b-tree node that is modified. That may be
a hidden node (internal, hidden b-tree node) or a real node.

Regards,
Thomas

Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller

Hi,

 JCR requires lookup of children by name and/or position (for orderable
 children), so the implementation needs to support all these cases
 efficiently.  The trickiest one to handle is probably Node.getNodes(String
 namePattern) because it requires using both name and position together.

While it's true that all that needs to be supported, I doubt that we
should try to optimize for all cases. Otherwise the normal case will
be slower.

Usually, there are not that many child nodes. In that case lookup is
not a problem: both a array and a hash map can be used (in memory).

If there are many child nodes, then we should try to optimize for the
most important case. I think it doesn't make sense to optimize for the
case that the long list (many thousand) children are manually
re-ordered (using orderBefore).

Regards,
Thomas

Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller

Hi,

 Even without using orderBefore, the specification still requires a stable
 ordering of children for nodes that support orderable child nodes (see 23.1
 of JCR 2.0 spec).

Thanks for the link! I see now my idea violates the specification
which says (23.3) When a child node is added to a node that has
orderable child nodes it is added to the end of the list. My idea was
to add the child node according to its name (until the order is
changed using orderBefore).

 One possibility is to limit the use of the B-tree to nodes that do not 
 support orderable children

You are right. Unfortunately orderable child nodes is the default.

Another solution is to keep a linked list from child node entry to
child node entry (only in this case). Let's see how complicated that
is.

By the way same name siblings would work fine (no linked list required).

Regards,
Thomas

Re: [jr3] Search index in content

2010-02-17 Thread Thomas Müller

+1

For simple search a built-in index would help a lot, for example node
names and (some) property values. Each property name could have its
own index. Advantages:

- transactional index updates
- reduced complexity
- reduced number of open files
- allows to implement Jackrabbit in C

I would not try to use that to index binaries (fulltext index), or
re-implement advanced features (ranking, phrase queries,
stemming,...).

Regards,
Thomas

Re: [jr3] Restructure Lucene indexing make use of Lucene 2.9 features

2010-02-17 Thread Thomas Müller

Hi

 each property indexed in its own Lucene field

Could you explain in more details? What is a 1:1 mapping? Do you mean
each property type should have it's own index, or each property name
should have its own index? Would this not increase the number of
Lucene index files a lot?

Regards,
Thomas

Re: [jr3] Unified persistence

2010-02-17 Thread Thomas Müller

Hi,

I would implement the storage layer ourselves. It could look like:

- FileDataStore: keep as is (maybe reduce the directory level by one).

- Each node has a number (I would use a long). Used for indexing.

- MainStorage: the node data is kept in an append-only main
persistence storage. When switching to a new file, the node lookup
table (node index) is appended. An optimization step would separate
less updated nodes (old generation) and frequently updated nodes.
Nodes and its child nodes are grouped together.

- Namespace index, name index, nodetype registry: start with a fixed
(hardcoded) list, and store additional entries as system nodes.

Regards,
Thomas

Re: [jr3] One workspace to rule them all

2010-02-17 Thread Thomas Müller

Hi,

 The most obvious trouble with this approach is that the node UUIDs
 would no longer be unique within such a super-workspace. I'm not sure
 how to best solve that problem, apart from switching to some
 alternative internal node identifiers. Any ideas?

Use a number (variable size when stored to disk; long in memory) as
the unique node identifier. Anyway we need a way to identify all nodes
for indexing (property/value index). This number would not necessarily
be accessible from the public API however.

I would still keep the UUID for backward compatibility, but only for
referenceable nodes. It would be stored as a (hidden) property.

Regards,
Thomas

Re: [jr3] Search index in content

2010-02-17 Thread Thomas Müller

Hi,

For me, there are two kinds of indexes: the property/value indexes,
and the fulltext index.

The property/value indexes are for property values, node names, paths,
node references, and so on. Such indexes (or indices) are relatively
small and fast. In relational databases, those are the secondary
indexes (non-primary-key indexes). Those index updates should be done
synchronously as part of the transaction (maybe even in the transient
space). Currently, we use Apache Lucene for this, but I wouldn't. I
would keep those indexes within the repository.

The fulltext index is (potentially) slow, specially fulltext
extraction. Therefore, fulltext index should be done asynchronously if
it takes too long. Also, in a clustered environment, at least text
extraction should only be done in one cluster node. I would still use
Apache Tika and Apache Lucene for this.

Regards,
Thomas

Re: [jr3] MVCC

2010-02-17 Thread Thomas Müller

Hi,

I would do MVCC in a similar way it is done in relational databases
such as PostgreSQL. See also
www.postgresql.org/files/developer/transactions.pdf

Concurrent writes and MVCC: usually MVCC means readers are never
blocked by other readers or writers, and writers are not blocked by
readers. However, writers can block other writers when trying to
update the same node (row in databases).

Concurrent writes to disk: I think this only makes sense if the
hardware supports it. With a single disk it doesn't make sense:
concurrent writes to two different positions or files is actually
slower than serialize writing, and only writing to one position in one
file.

Relational databases don't usually persist all (intermediate)
versions, just the committed version. I don't think that copy-on-read
is a good idea. If we use append-only storage, in theory all old
versions are available, but indexing those is problematic.

Regards,
Thomas

Re: [jr3] Unified persistence

2010-02-17 Thread Thomas Müller

Hi,

About 'append only' and 'immutable' storage. Here is an interesting link:

http://eclipsesource.com/blogs/2009/12/13/persistent-trees-in-git-clojure-and-couchdb-data-structure-convergence/

Regards,
Thomas

Re: Jackrabbit 3: extracting same name sibling support from the core

2010-02-16 Thread Thomas Müller

Hi,

A very simple implementation of my idea:

http://h2database.com/p.html#e5e5d0fa3aabc42932e6065a37b1f6a8

The method hasSameNameSibling() that is called for each remove(). If
it turns out to be a performance problem we could add a hidden
property in the first SNS node itself (only required there).

Does anybody see any other obvious problems?

Regards,
Thomas

Re: Upgrade from 1.5.5 to 2.0.0 was unsuccessful in clustered environment: Cause: java.sql.SQLException: Lock wait timeout exceeded; Any ideas?

2010-02-11 Thread Thomas Müller

Hi,


 Could you point me in the right direction for a production-ready model 3
 deployment model (where we can access the repository remotely)?

There is some documentation available here:
http://wiki.apache.org/jackrabbit/RemoteAccess

Regards,
Thomas

Jackrabbit 3: extracting same name sibling support from the core

2010-02-11 Thread Thomas Müller

Hi,

About SNS (same name siblings): what about moving that part away from
the core? Currently, the Jackrabbit architecture is (simplified):

1) API layer (JCR API, SPI API)
2) Jackrabbit core, which knows about SNS

After moving the SNS support, it would be something like this:

1) API layer (JCR API, SPI API)
2) SNS support code, which knows about SNS, and maps SNS node names
to/from internal node names
3) Jackrabbit core, doesn't know anything about SNS (node names must
be unique, [ and ] are supported within node names)

My hope is that this would simplify the core, because it doesn't have
to deal with SNS at all. Disadvantage: there is a risk that certain
things would get a bit slower if SNS are actually used (specially if
there are lots of SNS).

Regards,
Thomas

Re: Jackrabbit 3: extracting same name sibling support from the core

2010-02-11 Thread Thomas Müller

Hi,

 Is this could be an optional feature in 3.X? As JCR 2.X is out and it could 
 raise comparability problem, right?

This change wouldn't affect the public API. SNS would still be
supported as they are done now. Maybe with a few changes, but all
within the JCR 2.0 specification.

About compatibility: existing repositories would need to be converted
in some way, that's true. One way to convert is the repository copier
tool: http://wiki.apache.org/jackrabbit/BackupAndMigration

Regards,
Thomas

Re: Upgrade from 1.5.5 to 2.0.0 was unsuccessful in clustered environment: Cause: java.sql.SQLException: Lock wait timeout exceeded; Any ideas?

2010-02-07 Thread Thomas Müller

Hi,

Please use the 'user' list for questions.

 the lock timeouts are occurring only with non-jcr tables during routine 
 actions in other areas of our site, even though they have nothing to do with 
 Jackrabbit.

It sounds like the problem is not related to Jackrabbit then.

 disabling jackrabbit solved the problem

It could be due to lower database activity. What happens if you use
two independent databases?

Regards,
Thomas

Re: change proposal DataStore

2010-01-25 Thread Thomas Müller

Hi,

Currently there is only data store per repository. If you need a data
store per workspace, then you need one repository per workspace.

 - Assign a datastore per workspace (customer) so it's possible to measure
 (and limit) storage usage for a given customer

This more sounds like an accounting problem than a technical
problem. Could you add some accounting code to the application? For
example, use an ObservationListener to calculate the disk space used
by a user (workspace). Or, use a wrapper around the input stream and
measure / limit storage like this.

 - Dynamic allocation, so newer or more accessed nodes will be stored on
 faster disk and old nodes will be mover to sata slower disk

Such a 'caching data store' would be nice. It's a bit tricky to
implement I think. Currently, there is no such implementation, however
patches are always welcome.

Regards,
Thomas

Re: change proposal DataStore

2010-01-25 Thread Thomas Müller

Hi,

 extend the datastore interface
 workspace name, node name, property name ...

I'm not sure if the workspace / node name / node identifier / property name
is always available.

One advantage of this addition would be: it could speed up garbage
collection. If a binary object knows the node identifier(s), garbage
collection could check the large objects first (because you could keep links
from the binary object to the place where it is / was used). I'm not
completely against such a change, however for the given problem (accounting)
it does sound like the wrong solution.

Regards,
Thomas

Re: [VOTE] Release Apache Jackrabbit 2.0.0

2010-01-22 Thread Thomas Müller

+1 Release this package as Apache Jackrabbit 2.0.0

- checksums OK
- licences OK
- notice.txt, readme.txt and release-notes.txt files OK
- mvn clean install OK with Sun Java 1.5.0_22 / Mac OS X

Regards,
Thomas

Re: javax.naming.NamingException: The repository home D:\repository appears to be in use since the file named .lock is locked by another process.

2010-01-21 Thread Thomas Müller

Hi,

About the repository lock see http://wiki.apache.org/jackrabbit/RepositoryLock

P.S. Please use the user list for usage questions

Regards,
Thomas


On Thu, Jan 21, 2010 at 2:04 PM, abhishek reddy
abhishek.c1...@gmail.com wrote:
 hi,

 For the first time, i can able to access repository successfully.
 second time onwards it is giving the following exception.

 javax.naming.NamingException: The repository home D:\repository appears to
 be in use since the file named .lock is locked by another process.

 I have created the Repository and kept it in the application scope.
 and everytime iam accessing the repository in the following manner


 Repository repository = (Repository)
 sc.getAttribute(repository);
 session=repository.login();

 //code

 session.close();

 how to overcome this problem ?
 Everytime do i need to remove this lock file manually?
 need help regarding

 --
 Abhishek

Re: Hudson build is still unstable: Jackrabbit-trunk » Jackrabbit Core #961

2009-12-13 Thread Thomas Müller

Hi,

 now org.apache.jackrabbit.core.util.CooperativeFileLockTest.testFileLock 
 failed
 thomas, I think it was you who added this test recently, right?

Yes... does Hudson run on Windows? It looks like a timing problem (the
thread doesn't stop quickly enough).

Regards,
Thomas

Re: Sling's use of Jackrabbit

2009-12-01 Thread Thomas Müller

Hi,

 We can't change that API part in 2.x.

I understand we should not _change_ (or remove) a public API within
2.x. That's actually the main reason why I wouldn't export the
PersistenceManager API now, because it would force us to keep it like
it is for the whole 2.x.

But we can still export _additional_ packages within 2.x (for example,
export the PersistenceManager API in 2.1 or 2.2 if really needed).

Regards,
Thomas

Re: Sling's use of Jackrabbit

2009-11-30 Thread Thomas Müller

Hi,

 The problem of Jackrabbit Core is, that apart from implementing the
 Jackrabbit API (which is imported in the bundle), it has its internal
 API (for example the PersistenceManager interface or others). This
 internal API is not properly separated (in terms of Java Packages) from
 implementation classes, which should not leak into the exported package
 space.

If this really is an issue, we should try to solve it. Is is really an
issue? A solution might be to move the PersistenceManager and other
interfaces to jackrabbit-api (would we need to change the package
name?).

Regards,
Thomas

Re: Sling's use of Jackrabbit

2009-11-30 Thread Thomas Müller

Hi,

 It's worth to move some of the internal API to jackrabbit-api
 for other bundle to provide different implementation. Tt could well
 documented and better for third party to extend the jackrabbit.

I would do that only if there is an actual need for it. Do you have
another implementation? Persistence manager, data store, or journal?
If yes, would it be enough to just move the persistence manager
interface, or do we need to do something else (for example does your
implementation extend the abstract bundle persistence manager)?

Regards,
Thomas

Re: Sling's use of Jackrabbit

2009-11-30 Thread Thomas Müller

Hi,

 I don't have another implementation at the moment for any of them.

OK, good to know.

 I can think of it's might be possible to add key/value store as bundle
 persistence store in future.

I would wait until it's a real problem. Trying to solve _potential_
problems in advance is usually the wrong path.

Regards,
Thomas

Re: Sling's use of Jackrabbit

2009-11-30 Thread Thomas Müller

Hi,

 I would not move the API to the Jackrabbit API. Just moving the
 interfaces into sepatate packages, eg. below o.a.j.core.api would
 suffice it to export this space and leave the implementation private.

 +1

 Moving the persistence manager interface and co (basically everything
 that can be swapped by custom implementations in the repository.xml)
 into a separate API package is a good idea. But it makes sense to
 separate this from the client-side API in jackabbit-api and only have
 this as an exported package in jackrabbit-core.

-1

As I already wrote, it doesn't make sense to do that now. We can still
do that later on, when there is actually somebody that needs it.

Regards,
Thomas

Re: [VOTE] Release Apache Jackrabbit 2.0 beta3

2009-11-23 Thread Thomas Müller

+1 Release this package as Apache Jackrabbit 2.0-beta3

- checksums OK
- licences OK
- notice.txt, readme.txt and release-notes.txt files OK
- mvn clean install OK with Sun Java 1.6.0_15 / Mac OS X

Regards,
Thomas



On Mon, Nov 23, 2009 at 10:21 AM, Sébastien Launay
sebastienlau...@gmail.com wrote:
 Hi,

 [X] +1 Release this package as Apache Jackrabbit 2.0-beta3

 - checksums [OK]
 - signature [OK]
 - license, notice, header and readme files [OK]
 - maven build [OK with one failed test] with Ubuntu Jaunty / Sun Java
 1.6.0_14-b08

 A test case in jackrabbit-jcr-client prevent the build from being successful
 but i think this test case is not critical for releasing a beta version:
 ---
 Test set: org.apache.jackrabbit.client.RepositoryFactoryImplTest
 ---
 Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 8.253
 sec  FAILURE!
 testGetSpi2davexRepository(org.apache.jackrabbit.client.RepositoryFactoryImplTest)
  Time elapsed: 0.078 sec   ERROR!
 java.lang.UnsupportedOperationException: Missing implementation
        at 
 org.apache.jackrabbit.spi2dav.ExceptionConverter.generate(ExceptionConverter.java:109)
        at 
 org.apache.jackrabbit.spi2dav.ExceptionConverter.generate(ExceptionConverter.java:49)
        at 
 org.apache.jackrabbit.spi2dav.RepositoryServiceImpl.getRepositoryDescriptors(RepositoryServiceImpl.java:537)
        at 
 org.apache.jackrabbit.jcr2spi.RepositoryImpl.init(RepositoryImpl.java:82)
        at 
 org.apache.jackrabbit.jcr2spi.RepositoryImpl.create(RepositoryImpl.java:95)
        at 
 org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory.getRepository(Jcr2spiRepositoryFactory.java:166)
        at 
 org.apache.jackrabbit.client.RepositoryFactoryImpl.getRepository(RepositoryFactoryImpl.java:75)
        at 
 org.apache.jackrabbit.client.RepositoryFactoryImplTest.testGetSpi2davexRepository(RepositoryFactoryImplTest.java:169)
 Caused by: org.apache.jackrabbit.webdav.DavException: Method REPORT is
 not defined in RFC 2068 and is not supported by the Servlet API
        at 
 org.apache.jackrabbit.webdav.client.methods.DavMethodBase.getResponseException(DavMethodBase.java:172)
        at 
 org.apache.jackrabbit.webdav.client.methods.DavMethodBase.checkSuccess(DavMethodBase.java:181)
        at 
 org.apache.jackrabbit.spi2dav.RepositoryServiceImpl.getRepositoryDescriptors(RepositoryServiceImpl.java:507)
        ... 31 more

 I do not test the maven artefacts just the source package and the war.

 --
 Sébastien Launay

Re: How to reclaim disk space?

2009-11-11 Thread Thomas Müller

Hi,

How big is this directory?

By default, Jackrabbit uses Apache Derby to persist data. This directory
belongs to the embedded Apache Derby databases. There is a way to compact
Derby databases, however you would need implement this yourself. I found the
link to the Apache Derby documentation:

http://db.apache.org/derby/docs/10.5/ref/ref-single.html#rrefaltertablecompress

Regards,
Thomas


On Wed, Nov 11, 2009 at 4:21 PM, Xudaquan xudaquan2...@yahoo.cn wrote:

Hello, I meet a problem describe below when I use jackrabbit:


   when I add nodes to the repository,this size of some files in de diretory
 'jackrabbit\workspaces\default\db\seg0'  increase incessantly, but after
 stop adding nodes to the repository, I remove all nodes I added, all the
 sizes of files in diretory 'jackrabbit\workspaces\default\db\seg0'  remain
 the same.  so the disk space jacerabbit use cann't release.

 how to solve this problem?

 thanks!

 --
 好玩贺卡等你发，邮箱贺卡全新上线！http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/

1 2 3 >

1 - 100 of 224 matches

Mail list logo