Yeah agreed I saw your project and I liked the way you created binary and quad age groups. *Indhu* can share more details on linear regression approach and accuracy. As far as I know it's a bigram model based on top 10k features
This is how Tika CLI response looks like - Content-Length: 6954 Content-Type: application/xml *Estimated-Author-Age: 23* *Estimated-Author-Age-Range: 18-28* X-Parsed-By: org.apache.tika.parser.CompositeParser X-Parsed-By: org.apache.tika.parser.nlp.classifier.TextFeatureParser resourceName: pom.xml I was thinking to add more meta data fields from different approaches in same response. For example we can add a new field "*Estimated-Author-Age-Binary-Group" *to this. We can run multiple REST API call in parallel and enable/disable through property file. Basically let user define what all API it wants to run and we can club all the results together through TIKA. Thanks -- Madhav Sharan On Tue, Jun 14, 2016 at 12:51 AM, Anthony Beylerian < [email protected]> wrote: > Hi Madhav, > > Thank you for sharing, yes maybe it's possible. > > Although there is overlap, the two approaches are a bit different. > > Do you have some documentation on the performance of the linear regression > approach? > > I'm not sure how well it would perform for gender (binary) and other > attributes. > > Ideally it would be desirable to have a way to capture all traits with > reasonable performance. > > Best, > > Anthony > > > On Tue, Jun 14, 2016 at 8:46 AM, Madhav Sharan <[email protected]> wrote: > >> Hi Anthony, age prediction part of this enhancement looks very similar to >> https://issues.apache.org/jira/browse/TIKA-1988 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_TIKA-2D1988&d=DQMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=9RmoO3IABeowGsd4to3rmAsNGyj0_JZvKV652Y5Vglw&s=nKX9E7Bx4P7K2XTDx09XhgeiiOMPspDmT0Adk7GIPfg&e=> >> >> Do you see any way we can collaborate on this feature? I was thinking to >> build a TextFeatureParser which can parse multiple text based features >> like >> age. >> >> In our project for age prediction we built a classifier using linear >> regression which is available through a REST API ( more details in [0] ). >> We can configure multiple such REST APIs in TIKA through property file and >> then let the TextFeatureParser collate and present all the results. >> >> Let me know what you think about it. [1] has my code for >> TextFeatureParser, >> I will be giving a PR soon. >> >> CCing Indhu for any questions regarding [0] >> >> [0] https://github.com/USCDataScience/Age-Predictor >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_USCDataScience_Age-2DPredictor&d=DQMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=9RmoO3IABeowGsd4to3rmAsNGyj0_JZvKV652Y5Vglw&s=xd4ervXX_i0ZIpOSFgj80D563gcu8x3Vr1EVCE4f_g0&e=> >> [1] https://github.com/smadha/tika/tree/TIKA-1988 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_smadha_tika_tree_TIKA-2D1988&d=DQMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=9RmoO3IABeowGsd4to3rmAsNGyj0_JZvKV652Y5Vglw&s=qYjX6OCUXpDmX8074vxKXpcuF6-ckVuWorr4135QBlw&e=> >> >> >> -- >> Madhav Sharan >> > >
