Re: Antw: Re: duplicates

Marko Niinimaki Mon, 29 Mar 2010 10:27:16 +0200

Hello,

ok! Improving bibmatch has been planned for quite a long time, and
it's good to know we can get feedback from you, Guido.


I'll start by improving the documentation that generates
http://invenio-demo.cern.ch/help/admin/bibmatch-admin-guide

Then, "fuzziness", that is, trying to match the input in collection by
being "close enough". This would mean that the system would suggest a
near match if there are typos, extra punctuation etc.

Yours,
Marko

On Mon, Mar 29, 2010 at 12:26 AM, Tibor Simko <[email protected]> wrote:
> Hi Guido:
>
> (CC-ing project-cdsware-developers)
>
> On Fri, 26 Mar 2010, Guido Pelzer wrote:
>>> On Fri, 26 Mar 2010, Guido Pelzer wrote:
>>>> many thanks for your mail. bibmatch works good. do you have a
>>>> detailed comment of bibmatch options?
>>>
>>> Currently not much docs besides the guide:
>>>
>>>    <http://invenio-demo.cern.ch/help/admin/bibmatch-admin-guide>
>>>
>>> But it may soon be updated, since Marko will work on small fuzzy-like
>>> features:
>>>
>>>    <https://savannah.cern.ch/task/?3273>
>>>
>>
>> yes, i had already seen, but i have problems with the advanced options, 
>> especially
>> -m --mode=(a|e|o|p|r)[3]
>> -o --operator=(a|o)[2] -> and/or???
>> different between --print-new and --print-match
>
> The mode and operators are taken from search engine API.
>
> The output streams NEW prints unmatched records, MATCH matched records
> when there was exactly one dupe-like hit, and AMBIGUOUS prints matched
> records when there was more than one dupe-like hit.
>
> Personally I would prefer bibmatch to produce more than one output
> stream at the same go, for example in case of two output streams:
>
>   $ bibmatch foo.xml > foo_unmatched.xml 2> foo_matched.xml
>
> so that one has to process only its output files without diffing WRT the
> input file.
>
> Since Marko is attacking this module WRT some fuzziness, we can as well
> take this opportunity and change/prettify its API...
>
> Best regards
> --
> Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>
>

Re: Antw: Re: duplicates

Reply via email to