Hello MCF community,

I developed a transformation connector based on Jsoup. The goal of this code id 
to simply choose an encompassing tag in a HTML document for text extracting. 
And inside this tag, this connector allows you to remove subparts that you do 
no want : all the tags corresponding to declared types or specific attribute 
tag names for example.
I would like to know if it could interest you. The code is in Apache V2 licence 
 and I integrated it in our enterprise search solution (Datafari). This morning 
I integrated the code in a fork MCF project on GitHub. Obviously it needs some 
work including code refactoring, renaming classes, unit tests that I will be 
able to do if you are interested by the code.
The code is here : 
https://github.com/otavard/manifoldcf/tree/htmlextractorconnector 
<https://github.com/otavard/manifoldcf/commits/htmlextractorconnector>
And the documentation here : 
https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector
 
<https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector>

Best regards,

Olivier TAVARD


Reply via email to