Andreas Hartmann schrieb:
I volunteered to take the lead in the implementation of the "tag cloud"
feature (see [1]).
I added a first version to the contributions area. The tag cloud is
visible in the defaultfiredocs publication.
Some initial ideas:
IMO it makes sense to use the Dublin Core element "subject" to assign
tags to a document [2].
I hard-coded this element for the moment.
Definition: "The topic of the resource."
Comment: "Typically, the subject will be represented using keywords, key
phrases, or classification codes. Recommended best practice is to use a
controlled vocabulary. To describe the spatial or temporal topic of the
resource, use the Coverage element."
I guess this can be made configurable, we could just use the DC subject
as the default. Since tags can contain spaces, we should use multiple
meta data values to store multiple tags. A nice GUI for this has to be
implemented. Would it be sufficient to extend the standard meta data GUI
to allow entering multiple values, or do we need a dedicated tag
management GUI? I'd suggest to start with the existing meta data GUI.
I didn't take care of the multi-value handling yet. The tags are just
the terms which are indexed by Lucene. TODO: Define the meta element
values as keyword index fields instead of text fields to support phrases
(multi-word terms).
Finding all documents with a certain tag is rather simple since all meta
data are indexed.
I used the standard search for this purpose. I had to extend the lucene
module sitemap with a "raw" query type. The query looks like this:
\{http\://purl.org/dc/elements/1.1/\}subject:foobar
It's rather ugly that this appears in the search box, maybe we have to
add a dedicated meta data search or use another concept for special
search terms.
The real challenge is to generate a list of all
existing tags.
Maybe there is a performant way to generate the cloud using the index,
e.g. via a wildcard query. But this still needs some postprocessing, so
we'll probably have to cache the tag cloud.
Lucene allows to enumerate all terms for a particular field. To filter
the language, I had to add a loop which searches the index for each
term. I guess this takes quite a lot of time. Maybe someone knows a
better solution? Or maybe a new version of Lucene has a more flexible
API for term enumeration?
If you omit the language parameter of the IndexTermsGenerator, the
language filtering is skipped and the listing of the terms is probably
pretty fast.
If Lucene doesn't help, we have another nifty feature for this purpose:
the RepositoryListener interface. By registering a listener with the
repository, we can extract the tags of a document when it is saved, and
update the tag cloud accordingly. The cloud also has to be updated when
a document is removed. The details are a bit tricky (concurrency,
queuing), but I think there's nothing that can't be solved. In this case
we have to store the tag cloud. My first idea would be to use a
dedicated document for this purpose.
I'd prefer the dynamic generation using Lucene, though, because
otherwise we store redundant information in the repository which always
carries a certain risk.
I think we can use Lucene. No need for the repository listening.
Another issue is supporting the user when she enters the tags. The
system should present a list of existing tags, possibly with some kind
of autocomplete functionality. But I guess when we manage to generate
the cloud, this feature can easily be added.
I didn't tackle this issue yet.
Any comments and improvements are greatly appreciated!
-- Andreas
--
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]