Hi, OK thank you for the explanation and for the contribution integration. I did not know that the contribution was already part of the 2.10 release. I submitted a patch englobing the first patch and the new code on the JIRA issue : CONNECTORS-1500. It is a diff against the html extractor connector.
The documentation is here : https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector <https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector> If you want to integrate at least the user documentation on the official MCF site, no problem. Without it, it will be hard for users to understand the goal of this connector I think ! Best regards, Olivier TAVARD > Le 5 mai 2018 à 14:02, Piergiorgio Lucidi <[email protected]> a écrit : > > Hi, > > I have just updated the CHANGES.txt adding CONNECTORS-1500 included in the > 2.10 release with a mention to Olivier. > > Olivier, thank you so much for your contribution. > > We should find a good way to also create a test suite for this new > connector. > > Cheers, > PJ > > 2018-05-05 11:57 GMT+02:00 Karl Wright <[email protected]>: > >> Hi Olivier, >> >> This was actually already committed. But it was renamed as the >> html-extractor connector, not "datafari", which didn't mean anything to me. >> >> Any changes you want to make should therefore be supplied as a diff against >> the html-extractor connector. >> >> Sorry for the confusion!! >> >> Karl >> >> >> On Fri, May 4, 2018 at 4:28 PM Karl Wright <[email protected]> wrote: >> >>> Yes, please do update the patch. I'm sorry I did not get to this; many >>> other things intruded. I created the branch but did not apply the >> original >>> patch onto it, so please supply a whole new patch. >>> >>> Karl >>> >>> >>> On Fri, May 4, 2018 at 11:28 AM Olivier Tavard < >>> [email protected]> wrote: >>> >>>> Hi, >>>> >>>> I wanted to know if the code remains interesting for the MCF community. >>>> I updated it since the initial release so please tell me if I need to >>>> submit a new patch into the issue already created : >>>> https://issues.apache.org/jira/projects/CONNECTORS/ >> issues/CONNECTORS-1500 >>>> < >>>> https://issues.apache.org/jira/projects/CONNECTORS/ >> issues/CONNECTORS-1500 >>>>> >>>> >>>> Thanks, >>>> Best regards, >>>> >>>> Olivier TAVARD >>>> >>>> >>>>> Le 15 mars 2018 à 15:58, Karl Wright <[email protected]> a écrit : >>>>> >>>>> Excellent!! >>>>> >>>>> Thank you again. I'll try to set up the branch this weekend. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Thu, Mar 15, 2018 at 10:52 AM, Olivier Tavard < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Karl, >>>>>> >>>>>> Sure thing, I created a ticket : https://issues.apache.org/ >>>>>> jira/projects/CONNECTORS/issues/CONNECTORS-1500 with the code in >>>>>> attachment. >>>>>> No specific libraries used, just JSOUP library that is already in the >>>> MCF >>>>>> core project. >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Olivier >>>>>> >>>>>> >>>>>>> Le 15 mars 2018 à 11:51, Karl Wright <[email protected]> a écrit : >>>>>>> >>>>>>> Hi Oliver, >>>>>>> >>>>>>> Thank you very much for your contribution! >>>>>>> >>>>>>> To have a legal trail, I usually prefer the following approach -- >>>>>>> >>>>>>> (1) Create a ticket >>>>>>> (2) Attach a diff to the ticket >>>>>>> >>>>>>> We'll then integrate the diff into a branch, and then finally into >>>> trunk. >>>>>>> >>>>>>> Can you also let us know what kinds of dependent jars the >> contribution >>>>>>> has? We'd need to know about not only direct dependencies, but also >>>> any >>>>>>> downstream dependencies that may be incompatible with the Apache >>>> License. >>>>>>> Usually we can figure this out but it saves time to know in advance >> if >>>>>>> there are LGPL dependencies (for instance). >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Thu, Mar 15, 2018 at 6:35 AM, Olivier Tavard < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hello MCF community, >>>>>>>> >>>>>>>> I developed a transformation connector based on Jsoup. The goal of >>>> this >>>>>>>> code id to simply choose an encompassing tag in a HTML document for >>>> text >>>>>>>> extracting. And inside this tag, this connector allows you to >> remove >>>>>>>> subparts that you do no want : all the tags corresponding to >> declared >>>>>> types >>>>>>>> or specific attribute tag names for example. >>>>>>>> I would like to know if it could interest you. The code is in >> Apache >>>> V2 >>>>>>>> licence and I integrated it in our enterprise search solution >>>>>> (Datafari). >>>>>>>> This morning I integrated the code in a fork MCF project on GitHub. >>>>>>>> Obviously it needs some work including code refactoring, renaming >>>>>> classes, >>>>>>>> unit tests that I will be able to do if you are interested by the >>>> code. >>>>>>>> The code is here : https://github.com/otavard/manifoldcf/tree/ >>>>>>>> htmlextractorconnector < >>>> https://github.com/otavard/manifoldcf/commits/ >>>>>>>> htmlextractorconnector> >>>>>>>> And the documentation here : https://datafari.atlassian. >>>>>>>> >>>> net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+ >>>>>>>> connector <https://datafari.atlassian.net/wiki/spaces/DATAFARI/ >>>>>>>> pages/237240321/HTML+Extractor+Transformation+connector> >>>>>>>> >>>>>>>> Best regards, >>>>>>>> >>>>>>>> Olivier TAVARD >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >> > > > > -- > Piergiorgio Lucidi > https://www.open4dev.com
