Hi Indhu,

Thank you very much for the details.

Just to confirm, the age is estimated within a 10 year window, so 65% owes
to having the estimate window overlap the actual age is this correct?

In this case, this makes sense for age, but other cases can't be treated
similarly since the values aren't contiguous but categorical.

I recommend we use an approach similar to the GeoTopicParser where an
OpenNLP classifier is used.

Best,

Anthony

On Wed, Jun 15, 2016 at 12:47 PM, Indhu Kamala Kumar <[email protected]>
wrote:

>
> Hi,
>
> We use Linear Regression to predict the Author Age. The model is created
> by selecting the top 'n' bi-grams as the features. Linear regression model
> is applied on generated features where the x axis is a matrix of features
> and y axis is the corresponding age array. The co-efficients, intercepts
> are calculated and the age is predicted. The age is then grouped by adding
> and subtracting 5 years to the predicted age. The model is about 65%
> accurate for large documents. Here is the link to the research paper we
> referred to:
> http://repository.cmu.edu/cgi/viewcontent.cgi?article=1215&context=lti
>
> Regards,
> Indhu
>
> On Tue, Jun 14, 2016 at 9:56 AM, Madhav Sharan <[email protected]> wrote:
>
>> Yeah agreed I saw your project and I liked the way you created binary and
>> quad age groups. *Indhu* can share more details on linear regression
>> approach and accuracy. As far as I know it's a bigram model based on top
>> 10k features
>>
>> This is how Tika CLI response looks like -
>>
>> Content-Length: 6954
>> Content-Type: application/xml
>> *Estimated-Author-Age: 23*
>> *Estimated-Author-Age-Range: 18-28*
>> X-Parsed-By: org.apache.tika.parser.CompositeParser
>> X-Parsed-By: org.apache.tika.parser.nlp.classifier.TextFeatureParser
>> resourceName: pom.xml
>>
>> I was thinking to add more meta data fields from different approaches in
>> same response. For example we can add a new field 
>> "*Estimated-Author-Age-Binary-Group"
>> *to this. We can run multiple REST API call in parallel and
>> enable/disable through property file. Basically let user define what all
>> API it wants to run and we can club all the results together through TIKA.
>>
>> Thanks
>>
>> --
>> Madhav Sharan
>>
>>
>> On Tue, Jun 14, 2016 at 12:51 AM, Anthony Beylerian <
>> [email protected]> wrote:
>>
>>> Hi Madhav,
>>>
>>> Thank you for sharing, yes maybe it's possible.
>>>
>>> Although there is overlap, the two approaches are a bit different.
>>>
>>> Do you have some documentation on the performance of the linear
>>> regression approach?
>>>
>>> I'm not sure how well it would perform for gender (binary) and other
>>> attributes.
>>>
>>> Ideally it would be desirable to have a way to capture all traits with
>>> reasonable performance.
>>>
>>> Best,
>>>
>>> Anthony
>>>
>>>
>>> On Tue, Jun 14, 2016 at 8:46 AM, Madhav Sharan <[email protected]> wrote:
>>>
>>>> Hi Anthony, age prediction part of this enhancement looks very similar
>>>> to
>>>> https://issues.apache.org/jira/browse/TIKA-1988
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_TIKA-2D1988&d=DQMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=9RmoO3IABeowGsd4to3rmAsNGyj0_JZvKV652Y5Vglw&s=nKX9E7Bx4P7K2XTDx09XhgeiiOMPspDmT0Adk7GIPfg&e=>
>>>>
>>>> Do you see any way we can collaborate on this feature? I was thinking to
>>>> build a TextFeatureParser which can parse multiple text based features
>>>> like
>>>> age.
>>>>
>>>> In our project for age prediction we built a classifier using linear
>>>> regression which is available through a REST API ( more details in [0]
>>>> ).
>>>> We can configure multiple such REST APIs in TIKA through property file
>>>> and
>>>> then let the TextFeatureParser collate and present all the results.
>>>>
>>>> Let me know what you think about it. [1] has my code for
>>>> TextFeatureParser,
>>>> I will be giving a PR soon.
>>>>
>>>> CCing Indhu for any questions regarding [0]
>>>>
>>>> [0] https://github.com/USCDataScience/Age-Predictor
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_USCDataScience_Age-2DPredictor&d=DQMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=9RmoO3IABeowGsd4to3rmAsNGyj0_JZvKV652Y5Vglw&s=xd4ervXX_i0ZIpOSFgj80D563gcu8x3Vr1EVCE4f_g0&e=>
>>>> [1] https://github.com/smadha/tika/tree/TIKA-1988
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_smadha_tika_tree_TIKA-2D1988&d=DQMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=9RmoO3IABeowGsd4to3rmAsNGyj0_JZvKV652Y5Vglw&s=qYjX6OCUXpDmX8074vxKXpcuF6-ckVuWorr4135QBlw&e=>
>>>>
>>>>
>>>> --
>>>> Madhav Sharan
>>>>
>>>
>>>
>>
>

Reply via email to