I see your point, but the problem is the tokenization - you have (in
your head) made the parse move, and selected OR, but the parser
cannot.

I would like to invite you to have a look at this post

http://29min.wordpress.com/2012/03/01/how-to-build-and-test-invenio-query-parser-iii/

There is a new component inside the MontySolr that could do the job in
a different way and what you may not know though, is that this parser
is written using ANTLR and ANTLR can generate Python code. So it is
not only for Java, but quite possibly for Invenio (though I have not
tested the Python generator yet).

roman



On Thu, Mar 1, 2012 at 5:49 PM, Jan Åge Lavik <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Just to shoot in here, what about introducing precedence in this case?
> By considering always the index (or marc field) references first, thus
> splitting up the query before doing the regular expression to look for
> regular expressions.
>
> Of course, we are still stuck when:
> hep-ph/0105155 OR astro-ph/0104076
>
> But precedence is also our friend here, by having the space or "OR"
> splitting the query before doing anything more. However, this could be
> ambiguous WRT having OR or SPACE inside the regular expression. We
> could then consider both and choose the one giving back most
> specialised results and/or inform the user about this ambiguity.
>
> Cheers,
> Jan
>
> On 03/01/2012 04:19 PM, Roman Chyla wrote:
>> Hi Sam, all,
>>
>> Apologies if already known, but Sam today stumbled upon the
>> following query and got zero results
>>
>> 037:hep-ph/0105155 OR 037:astro-ph/0104076
>>
>>
>> It is because Invenio uses slashes as brackets meaning regex
>> query, unfortunately this usage is ambiguous, because '/' character
>> can be a part of the normal token
>>
>> So the query is wrongly tokenized
>>
>> ['+', 'hep-ph/0105155 OR 037:astro-ph/0104076', '037', 'a']
>>
>> Using parenthesis doesn't help
>>
>> 037:"hep-ph/0105155" OR "037:astro-ph/0104076"
>>
>> because the pattern is considering anything inside '/ /'
>>
>> re_pattern_regexp_quotes = re.compile("\/(.*?)\/")
>>
>> It would be possible to use negative lookbehind and escaping, but
>> that requires two operations (change regex, replace escape)
>>
>> In [78]:
>> re.compile('(?<!\\\\)\/(.*?)(?<!\\\\)\/').findall('037:hep/123 OR
>> 037:exp/567') Out[78]: ['123 OR 037:exp']
>>
>> In [79]:
>> re.compile('(?<!\\\\)\/(.*?)(?<!\\\\)\/').findall('037:hep\/123 OR
>> 037:exp\/567') Out[79]: []
>>
>> I guess it is a harder problem.
>>
>> Roman
>>
>>
>> On Thu, Mar 1, 2012 at 1:51 PM, Carli Samuele
>> <[email protected]> wrote:
>>> '(037:hep-th/0112017) | (037:hep-th/0112020)' -- |-- | Samuele
>>> Carli |-- | Contacts: | |       Home page   : www.csspace.net |
>>> E-mail      : carlisamuele _at_ csspace.net |       Icq         :
>>> 60401601 |       MSN         : [email protected] (no e-mails
>>> here!) |       Skype       : wohthan |       jabber/gtalk:
>>> [email protected] |--
>
>
> - --
> - --
> Jan Åge Lavik
>
> CERN System Librarian
> GS-SIS
>
> Office: 3-1-014
> Mailbox: C27800
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQEcBAEBAgAGBQJPT6h8AAoJEC02y7lWYDZkvzIH/iMRc/BvvV++bWCliic5xlku
> iSN/hu8kR0/lMyPaHu1yAjHU3vgJf/D3pidDzsjAnPD074cDT0dA8v0U7WATk0or
> H/adojaVwtjWhZj+ZZpwU1vo0lfkfJa0loRhY+VImB6nUB+uj6v2S+AaVNv/+Czn
> uDQTyRA0PIVChZy7TsKpUVI3cLCnDT0ZFo4qhWAzo7C/MTHCaaLal2Md+pIpdjXB
> xIpqB5f9JgWyaY8G1eEfdj7vp6+EtXWmc9erIxMzuK6XzZTILAfPpY37MbqC94f/
> AlFXZc52X0MbXgpfLEwn1uLtNAcj5Uo5kqU4n3rCmHp/4M0OAIzX8yOa4MbyqgI=
> =/i6v
> -----END PGP SIGNATURE-----

Reply via email to