Hi Alex and Come,
Im checking your cases. (see below for the french cases)
so getting some stats on your sample:
Annotated count of SF:
4
Total counts of SF:
55
Annotation probability
0.07272727272727272
---------------Candidates------------------------------------
http://dbpedia.org/resource/Eidetic_memory
http://dbpedia.org/resource/Recall_(memory)
The surface form is in the main Surface Form storage
this sample is not spottable because of its probability, and yeah, the
movie is not in the candidates.
==================
So I think this problem is different from the issue you previously posted
on github.
So the FSA definitely seems to improve lowercases handling, simply cause
the fsa is built on all the surface forms which are in the main store.
Which means that all the lowercases forms of things in the surfaceform
store are in principle spottable. However There are some filters .
There are two Surface form storages: the main storage ( where uppercases
are supposed to be stored) and the lowercase storage
The main one is supposed to store uppercases surfaceforms, however as far
as I understand it also stores lowercases which explicitly were annotated
in the data used to generate the models. So in your case there was a "total
recall" annotation with the candidate topics you see from the output. Thats
why it lives in the main storage.
The lowercase storage is meant to store artificial surface forms created
from making lowercases of things in the main Storage.
When the spotter is called, it first checks the main storage and only if
nothing can be found there it will check the lowercase storage.
So in your case, because "total recall" exists in the main, its never going
into the second.
Another issue I know of (which can affect also spotting lowercases) is the
discount mechanism, there is an open issue about it. It affects surface
forms which are subparts of others for example: "apple" is a SF, but it is
also a substring of another SF: "Apple macbook pro"
======
Checking on the french case: "*Hétérozygote*"
there is a "h*étérozygote*" with annotationprobability of 0.67 in the main
surfaceform storage.
My guess here is that there is an issue with the FSA, since the FSA should
match `h*étérozygote *` as a candidate surfaceform and then the spotter
should find it in the uppercase storage.
I would say that most likely it is an issue with the stemmer/FSA creation.
On Tue, May 20, 2014 at 8:44 AM, David Przybilla <[email protected]>wrote:
> Hi Alex,
>
> I'll check the case you mentioned ("total recall")
>
> In my experience the problems are not longer with teh FSA but with other
> bits. For example the discounting mechanism (I.e "google", " apple" are not
> longer spottable
> Am 19.05.2014 19:00 schrieb "Alex Olieman" <[email protected]>:
>
> I've been struggling with the same issue for over a week now, and think
>> there is cause to re-open the issue on github.
>> As I understand from the github discussion, there should only be a
>> problem with the OpenNLPSpotter. However, using the FSASpotter I see
>> exactly the same problem. Adding lowercase surface forms for resources also
>> doesn't seem to change much.
>>
>> Has something changed since the issue was closed on github?
>>
>> A nice test case is "total recall movie": it has a completely different
>> set of candidates than "Total Recall movie".
>>
>> Cheers,
>> Alex
>>
>>
>> On Thu, May 15, 2014 at 5:23 PM, Sang Venkatraman <
>> [email protected]> wrote:
>>
>>> Hi -- An issue was created for this (and has been closed). I have not
>>> had a chance to test this stuff recently but the comments should give you
>>> some idea.
>>>
>>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/196
>>>
>>> Thanks,
>>> Sang
>>>
>>>
>>> On Thu, May 15, 2014 at 9:19 AM, Côme SAUVAL
>>> <[email protected]>wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm currently trying to install and run dbpedia-spotlight on my own
>>>> server.
>>>> I have followed the "build source from Maven" tutorial and it's working
>>>> almost fine, I only have one problem left.
>>>>
>>>> Here's an example of the problem :
>>>>
>>>> The text where I want to spot some words is : "Quel est le lien entre
>>>> toux, déficit hétérozygote en alpha1 anti trypsine chez un patient porteur
>>>> de RCUH".
>>>>
>>>> When I run it on the demo or on the http://spotlight.sztaki.hu:2225/
>>>> server,
>>>> I get 2 results : *hétérozygote* and *trypsine*
>>>>
>>>> But when I use my own server, I only get those results *if the words
>>>> are capFirst* : "Quel est le lien entre toux, déficit *Hétérozygote*en
>>>> alpha1 anti
>>>> *Trypsine* chez un patient porteur de RCUH"
>>>> In this case I get *Hétérozygote* and *Trypsine *but if I let in in
>>>> lower case I get no results.
>>>>
>>>> Does anyone has had the same issue ?
>>>>
>>>> How can I configure the server to resolve this issue ?
>>>>
>>>> Thanks a lot for your help,
>>>>
>>>> Côme
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>>> Instantly run your Selenium tests across 300+ browser/OS combos.
>>>> Get unparalleled scalability from the best Selenium testing platform
>>>> available
>>>> Simple to use. Nothing to install. Get started now for free."
>>>> http://p.sf.net/sfu/SauceLabs
>>>> _______________________________________________
>>>> Dbp-spotlight-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>> Instantly run your Selenium tests across 300+ browser/OS combos.
>>> Get unparalleled scalability from the best Selenium testing platform
>>> available
>>> Simple to use. Nothing to install. Get started now for free."
>>> http://p.sf.net/sfu/SauceLabs
>>> _______________________________________________
>>> Dbp-spotlight-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>> Instantly run your Selenium tests across 300+ browser/OS combos.
>> Get unparalleled scalability from the best Selenium testing platform
>> available
>> Simple to use. Nothing to install. Get started now for free."
>> http://p.sf.net/sfu/SauceLabs
>> _______________________________________________
>> Dbp-spotlight-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>>
>>
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users