Re: [Dbpedia-gsoc] Contribute to DbPedia

Abhishek Gupta Wed, 11 Mar 2015 12:58:21 -0700

Hi Thiago,

I have addressed your concerns below:



>>> For handling the above issue I would like to state a couple of
>>> approaches:
>>>
>>> *Approach 1: This approach have high Space Complexity*
>>>
>>> From my perspective, when we create a SF storage it should satisfy
>>> following properties:
>>> 1) Extracting at least the correct surface form in almost all cases
>>> disregarding how many SFs we are extracting. by keeping the Surface Form
>>> selection loose.
>>> 2) We should be able to identify even that SF which we have not seen in
>>> our data e.g. HD -> Harley Davidson
>>>
>>> This situation might lead us to quite a large number candidate entities
>>> and hence difficult disambiguation but this is the cost we might have to
>>> pay for detecting unseen instances. But disambiguation might be able to
>>> handle this because the context of entities will be quite different in most
>>> of the cases.
>>>
>>> Perspective solutions to satisfy above properties:
>>> 1) Keeping SF selection hence candidate selection algorithm loose.
>>> 2) For this I would like to introduce another design besides stemming
>>> use following pipeline:
>>>     a) Convert every entity using a function and find
>>> probableSurfaceForms_1 (pSF1) like below:
>>>          (i) Entity Retained: "Michael Jeffery Jordan" -> "Michael
>>> Jeffery Jordan"
>>>          (ii) Acronym: "Michael Jeffery Jordan" -> "MJJ"
>>>          (iii) Omission: "Michael Jeffery Jordan" -> "Michael Jordan"
>>>          (iv) Combination of (i), (ii), and (iii): "M.J. Jordan", "M.
>>> Jordan", "M." (like Mr. M.)
>>>     b) Convert pSF1 to pSF2:
>>>          Step1: Remove Determiners, Prepositions, Stop-words, Punctuation
>>>          Step2: Convert to lowercase
>>>     c) Perform stemming on pSF2 & convert pSF2 to stemmedSurfaceForm
>>> (sFF)
>>>     d) Store sFFs with indexes of corresponding entities. (We are not
>>> storing pSF1s or pSF2s)
>>>
>>
>>
> One case which commonly occurs is combination of upper and lower cases
> i.e: "Michael jordan" but yeah :) this is a good direction. We partially
> tried to implement something inthat direction.
>
>
>> We tried to do something like this in the PR you mentioned, but in a way
>> less systematic way. So your suggestions are welcome.
>> One thing that you need to worry about though concerns step (b).
>> Spotlight is a bit language agnostic, so we would need to add information
>> about determiners, prepositions and so on for a series of language. This is
>> not very complicated but worth keeping in mind.
>>
>

I put the step (b) for the same purpose that David point out. In data we
might have instances like "M. jordan", "MJ", "Mr. m.j. jordan" which might
be rare but we have to take care of them.

As per Thiago's concern there might be problem in one case that if two or
more entities will be converted to same sFF and entities have similar
context. This is the worst case that we might not be able to handle.
Otherwise no matter any number of arbitrary candidate instances we might
get, our disambiguator should take care of that using the context. Even if
the case in which one entity in English and one entity in say French both
resulting into same sFF, as a result of step 2 operations. In this case we
mark both as candidate entities and then our disambiguator will choose the
correct one based on context.



>
>>> Now let's approach from raw text - After spotting a sequence of tokens
>>> (SoT) in raw text, which might be an entity reference, we should pass this
>>> sequence of tokens through the same function that I mentioned in step 2
>>> (part (a) and (b)) and then match the output with stored sFFs. And then we
>>> can find our concerning entities. We can then also calculate the relevance
>>> between sequence of tokes and sFFs corresponding entities using a function
>>> like Levenshtein Distance.
>>>
>>> I am doing some additional steps so as to address following rare
>>> situation which we don't have in our data:
>>> "My *HD* is world's most amazing motorcycles I have ever seen."
>>> This situation is quite unlike as there might not be any reference from
>>> HD to Harley Davidson but by context we can infer that it might be Harley
>>> Davidson using the above approach.
>>>
>>>
>> I'm not sure I get this, but we would definitely need to review the way
>> we score the association between a surface form and a given candidate. In
>> your example you rely on the contextual score, but it's very important to
>> keep in mind that in order for the loose matching approach to work, we
>> would need to do some improvements on the context store as well. This is
>> why there's another gsoc Idea related to that.
>>
>
I wanted to explain how we will process raw text. Let's take the example
below. After we spot a sequence of tokens ("Mr. Michael J. Jordan") using
our spotter we have to pass it through operations in step 2. Then we will
check whether our result (Michael J Jordan) is present in our sFF list or
not.

"Mr. Michael J. Jordan is the greatest basketball player of all time."


>>
>>> *Approach 2: This approach have might have high Time Complexity*
>>>
>>> Instead of finding candidate entities without using context we can use
>>> our context to some extent.
>>> 1) We locate our context in a connected entities graph using the context
>>> of sequence of tokens.
>>> 2) Find all entities linked to our context and they will be our
>>> candidate entities using Babelfy approach.
>>> 3) Pass all the candidate entities to the function mentioned in step 2
>>> of Approach 1.
>>> 4) Pass SoT from the same function (part (a) and (b))
>>> 5) Score candidates using Levenshtein Distance
>>>
>>> Actually in Approach 2 we are doing a bit of disambiguation in Step 1
>>> itself which will reduce our count of sFFs.
>>>
>>> Please review these ideas and provide your feedback.
>>>
>>>
>> I'm not sure whether I understand this entirely, but I'm very interested
>> in other ways to conceptualise context. Spotlight just uses a simple
>> distributional method, but you can definitely use the link structure within
>> wikipedia to find candidates that are more related to themselves. In your
>> example above the pair Motorcycle - Harley Davidson would be much more
>> related than Motorcycle - Hard Drive for example. However, this would
>> require coding from scratch, so bear in mind that it might be too much
>> work.
>>
>>
>>> Moreover I am trying to setting up the server on my PC itself which is
>>> taking some time due to a 10Gb file. I will come up with results as soon as
>>> I got some results. Till then I might follow up with some other warm-up
>>> task which is related to project ideas 5.15 and 5.16.
>>>
>>>
>
> The english model is a bit big. Consider using a smaller model for playing
> i.e:  Danish, Turkish, Dutch or Spanish
>


I would like to thank for your advice and feedback.

Regards,
Abhishek

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Contribute to DbPedia

Reply via email to