On Sat, Dec 18, 2010 at 12:33 PM, Samuele Kaplun <[email protected]> wrote:
> Hi Roman,
>
> Il giorno sab, 18/12/2010 alle 12.17 +0100, Roman Chyla ha scritto:
>> I agree this is cool, but something doesn't fit, at least I don't
>> understand how this could be used for the task of bibclassify, the
>> dict is good if you know (more or less) what you are looking for, but
>> the task of bibclassify is to find entities inside the fulltext - and
>> to find that out, bibclassify has to search for it - and it is not
>> exactly the same thing as the spell checking. I must be missing
>> something, could you explain to me what advantage at all there would
>> be in using the dict? As a fast cache of single level entries? I could
>> see how it would be more useful for the cache, citation links etc.,
>> but not for bibclassify.
>
> I am not that aware of how BibClassify works right now, but if its final
> goal is to look for the most frequent keywords (from a controlled set)
> inside a fulltext, then, post-poning the issue of the grammar (plural,
> genders, conjugations :-S), I think that it would be indeed possible to
> use dictd in a orthogonal way than we currently do with ontologies.
>
> Currently for each word in the ontology (correct me if I am wrong) we
> look how many times it appears in the text.
>
> On the other hand with dict, we might simply take all the words in the
> text, and filter them against the dictionary (which is built after the
> ontology), and then sum up the occurencies of repeated words.

OK, I see what you mean - could work, but would work mean 'improved'?

If you take an average of 3000 words times the real time reported for
lookup above:

3000 * 0.004 = 12s
or
3000 * 0.006 = 18s

that is two to three times slower than the current bibclassify
implementation (in case of HEP).

It could be faster for bigger dictionaries, like Eurovoc, because
bibclassify will slow down -- or if we manage to cut down the lookup
time (by making it local process?)

>
> The two methods should accomplish the same goal (if I am not wrong on
> BibClassify algorithm) but the latter should be in principle extremely
> fast, unless the grammar issue is the bottleneck.

in principle, direct lookups must be replaced by some approximate
lookups (btw, I think dictd could handle grammar variations better
than the current regex pattersn, so that would be a gain) - but it
will return more entries in many cases, then it is necessary to choose
the right one. Might be easy for limited domains - for Eurovoc, you
will need some sort of disambiguation

Another interesting problem is the single keyword made of several
tokens, like 'search engine' in the sentence:

Invenio comes with its own search engine implementation?

will you ask for:
1. invenio
2. comes
....
6 search
7 engine
8 implementation

 -- somehow combine 6+7 based on the responses?

or create collocations and ask for them (will double the number of
lookups, and does not skip inserted words)
Invenio comes
comes with
...
search engine


Don't get me wrong, dictd is cool. I am just saying it is tiny bit
more complicated.

Cheers,

  roman

>
> Cheers!
> Sam
>
> --
> Samuele Kaplun
> Invenio Developer ** <http://invenio-software.org/>
>
>

Reply via email to