James, Don't you think that the spawning DIH2.0 as separate war is a priority?
On Wed, Jun 11, 2014 at 6:39 PM, Dyer, James <[email protected]> wrote: > Alexandre, > > I think that writing a new entity processor for DIH is a much less risky > thing to commit than, say, SOLR-4799. Entity Processors work as plug-ins > and they aren't likely to break anything else. So a Morphline > EntityProcessor is much more likely to be evaluated and committed. > > But like anything else, you're going to need to explain what the need is > and what this new e.p. buys the user community. There needs to be unit > tests, etc. > > Besides this, if you can show how a morphline e.p. can be a step towards > migrating away from DIH entirely, then that would be a plus. Perhaps > create a new solr example along the lines of the dih solr example that > demonstrates to users this new way forward. This would go a long way in > convincing the community we have a viable alternative to dih. > > James Dyer > Ingram Content Group > (615) 213-4311 > > > -----Original Message----- > From: Alexandre Rafalovitch [mailto:[email protected]] > Sent: Tuesday, June 10, 2014 9:55 PM > To: [email protected] > Subject: Re: Adding Morphline support to DIH - worth the effort? > > Ripples in the pond again. Spreading and dying. Understandable, but > still somewhat annoying. > > So, what would be the minimal viable next step to move this > conversation forward? Something for 4.11 as opposed to 5.0? > > Anyone with commit status has a feeling of what - minimal - > deliverable they would put their own weight behind? > > Regards, > Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > On Mon, Jun 9, 2014 at 10:50 AM, [email protected] > <[email protected]> wrote: > >> One of the ideas over DIH discussed earlier is making it standalone. > > > > Yeah; my beef with the DIH is that it’s tied to Solr. But I’d rather see > > something other than the DIH outside Solr; it’s not worthy IMO. Why have > > something Solr specific even? A great pipeline shouldn’t tie itself to > any > > end-point. There are a variety of solutions out there that I tried. > There > > are the big 3 open-source ETLs: Kettle, Clover, Talend) and they aren’t > > quite ideal in one way or another. And Spring-Integration. And some > > half-baked data pipelines like OpenPipe & Open Pipeline. I never got > around > > to taking a good look at Findwise’s open-sourced Hydra but I learned > enough > > to know to my surprise it was configured in code versus a config file > (like > > all the others) and that's a big turn-off to me. Today I read through > most > > of the Morphlines docs and a few choice source files and I’m > > super-impressed. But as you note it’s missing a lot of other stuff. I > > think something great could be built using it as a core piece. > > > > ~ David Smiley > > Freelance Apache Lucene/Solr Search Consultant/Developer > > http://www.linkedin.com/in/davidwsmiley > > > > > > On Sun, Jun 8, 2014 at 5:51 PM, Mikhail Khludnev > > <[email protected]> wrote: > >> > >> Jack, > >> I found your considerations quite reasonable. > >> One of the ideas over DIH discussed earlier is making it standalone. So, > >> if we start from simple Morphline UI, we can do this extraction. Then, > such > >> externalized ETL, will work better with Solr Cloud than DIH works now. > >> Presumably we can reuse DIH Jdbc Datasources as a source for Morphline > >> records. > >> Still open questions in this approach are: > >> - joins/caching - seem possible with Morphlines but still there is no > such > >> command > >> - delta import - scenario we don't need to forget to handle it > >> - threads (it's completely out Morphline's concerns) > >> - distributed processing - it would be great if we can partition > >> datasource eg something what's done by Scoop > >> ... what else? > >> > >> > >> On Sun, Jun 8, 2014 at 6:54 PM, Jack Krupansky <[email protected] > > > >> wrote: > >>> > >>> I've avoided DIH like the plague since it really doesn't fit well in > >>> Solr, so I'm still baffled as to why you think we need to use DIH as > the > >>> foundation for a Solr Morphlines project. That shouldn't stop you, but > >>> what's the big impediment to taking a clean slate approach to > Morphlines - > >>> learn what we can from DIH, but do a fresh, clean "Solr 5.0" > implementation > >>> that is not burdened from the get-go with all of DIH's baggage? > >>> > >>> Configuring DIH is one of its main problems, so blending Morphlines > >>> config into DIH config would seem to just make Morphlines less > attractive > >>> than it actually is when viewed by itself. > >>> > >>> You might also consider how ManifoldCF (another Apache project) would > >>> integrate with DIH and Morphlines as well. I mean, the core use case > is ETL > >>> from external data sources. And how all of this relates to Apache > Flume as > >>> well. > >>> > >>> But back to the original, still unanswered, question: Why use DIH as > the > >>> starting point for integrating Morphlines with Solr - unless the goal > is to > >>> make Morphlines unpalatable and less approachable than even DIH > itself?! > >>> > >>> Another question: What does Elasticsearch have in this area (besides > >>> "rivers")? Are they headed in the Morphlines direction as well? > >>> > >>> > >>> -- Jack Krupansky > >>> > >>> -----Original Message----- From: Alexandre Rafalovitch > >>> Sent: Sunday, June 8, 2014 10:16 AM > >>> > >>> To: [email protected] > >>> Subject: Re: Adding Morphline support to DIH - worth the effort? > >>> > >>> I see DIH as something that offers a quick way to get things done, as > >>> long as they fit into DIH's couple of basic scenarios. Going even a > >>> little beyond hits bugs, bad documentation, inconsistencies and lack > >>> of ongoing support (e.g. SOLR-4383). > >>> > >>> So, if it works for you - great. If it does not - too bad, use SolrJ. > >>> And given what I observe, I believe the next round of improvements > >>> might be easier to achieve by moving to a different open-source pipe > >>> project than trying to keep reinventing and bandaging one of our own. > >>> Go where strongest community is, etc. > >>> > >>> Morphline can be seen as a replacement for DIH's EntityProcessors and > >>> Transformers (Flume adds other bits). The reasons I think it is worth > >>> looking at are as follows: > >>> 1) DIH is not really being maintained or further improved. So, the > >>> list of EP and Transformers is the same and does not account for new > >>> requests (which we see periodically on the mailing list); even the new > >>> implementations get stuck in JIRA (see the JIRA in original email) > >>> 2) It's not terribly well documented either, so people are always > >>> struggling to understand how the entity is actually generated and what > >>> happens when things go wrong > >>> 3) We are already bundling Morphline jars with Solr. But we are NOT > >>> using them in any way useful to a non-Hadoop Solr user. Which begs the > >>> question why did we add them (one answer I guess: because we don't > >>> have module system). > >>> 4) Morphlines have more primitives than DIH and the available list > keeps > >>> growing > >>> 5) What separate module for Solr? We have no discovery method for > >>> modules. Writing one for general consumption is like trying to sing in > >>> vacuum - the problem is a lot bigger that with individual offering. > >>> > >>> In terms of implementation, I think it take defining a custom > >>> MorphlineEntityProcessor which basically plugs into DIH's current > >>> DataSources. So, one could use for example DIH SqlDataSource to get a > >>> list of files and then to handoff to Morphline's black box to parse > >>> those files into records (e.g. Multiline records), augment them, etc. > >>> Then, at the end, this gets handed back to DIH to finish it up. I > >>> think this would work even with nested entities and transformers. The > >>> Admin UI should also work > >>> > >>> Eventually, I think we need a harder discussion about DIH, so this > >>> partial handover could be a way to test the waters. > >>> > >>> Does this make more sense? > >>> > >>> Regards, > >>> Alex. > >>> Personal website: http://www.outerthoughts.com/ > >>> Current project: http://www.solr-start.com/ - Accelerating your Solr > >>> proficiency > >>> > >>> > >>> On Sun, Jun 8, 2014 at 8:41 PM, Jack Krupansky < > [email protected]> > >>> wrote: > >>>> > >>>> It sounds more like an alternative to DIH rather than an incremental > >>>> add-on > >>>> to DIH. I mean, isn't Morphline really just "a DIH for Hadoop"? > >>>> > >>>> So, back to Shalin's question, which specific (please detail!) use > cases > >>>> of > >>>> DIH are enhanced by Morphline? > >>>> > >>>> Maybe it would help if you simply elaborate what benefits would accrue > >>>> to > >>>> adding Morphline to DIH - as opposed to creating a separate module for > >>>> Solr. > >>>> I suppose it depends on whether you consider DIH a solid foundation > or a > >>>> weak link in Solr that desperately needs firming up. > >>>> > >>>> -- Jack Krupansky > >>>> > >>>> -----Original Message----- From: Alexandre Rafalovitch > >>>> Sent: Sunday, June 8, 2014 1:40 AM > >>>> To: [email protected] > >>>> Subject: Re: Adding Morphline support to DIH - worth the effort? > >>>> > >>>> > >>>> Well, it's the same core scenario as DIH supports (apart from actual > >>>> data sources), but actively supported and developed by a company with > >>>> a lot more investment in it. For the primitives supported, see > >>>> > >>>> > http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html > >>>> > >>>> We don't bundle ALL of these with Solr, but I think we do bundle core, > >>>> solr-core and solr-cell packages, which is a good number and range of > >>>> functionality (e.g. readMultiLine). > >>>> > >>>> Regards, > >>>> Alex. > >>>> Personal website: http://www.outerthoughts.com/ > >>>> Current project: http://www.solr-start.com/ - Accelerating your Solr > >>>> proficiency > >>>> > >>>> > >>>> On Sun, Jun 8, 2014 at 12:23 PM, Shalin Shekhar Mangar > >>>> <[email protected]> wrote: > >>>>> > >>>>> > >>>>> I do not know much about morphlines but I'd like to know what > use-cases > >>>>> would be possible/easier/faster with such an integration? > >>>>> > >>>>> > >>>>> On Sun, Jun 8, 2014 at 10:32 AM, Alexandre Rafalovitch > >>>>> <[email protected]> > >>>>> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> Hello, > >>>>>> > >>>>>> I had a preliminary look around and it might be possible to plug > >>>>>> Morphline (already shipped with Solr) into DIH by creating a > bridging > >>>>>> EntityProcessor. > >>>>>> > >>>>>> Two questions: > >>>>>> 1) Do people see value in it? > >>>>>> 2) DIH is not very supported, so any addition seems to be a bit > stuck > >>>>>> in "rickety bridge, don't rock" discussion (e.g. SOLR-4799). I don't > >>>>>> want to suddenly be responsible for fixing the bridge before adding > a > >>>>>> standalone piece of code. So, if I write the code, how many general > >>>>>> DIH externalities would I also have to address (e.g. lack of tests, > >>>>>> etc)? > >>>>>> > >>>>>> Regards, > >>>>>> Alex. > >>>>>> P.s. Morphline could also be integrated in update request processor > >>>>>> chain. So, that could be an alternative project. > >>>>>> > >>>>>> Personal website: http://www.outerthoughts.com/ > >>>>>> Current project: http://www.solr-start.com/ - Accelerating your > Solr > >>>>>> proficiency > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: [email protected] > >>>>>> For additional commands, e-mail: [email protected] > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Regards, > >>>>> Shalin Shekhar Mangar. > >>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [email protected] > >>>> For additional commands, e-mail: [email protected] > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [email protected] > >>>> For additional commands, e-mail: [email protected] > >>>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> For additional commands, e-mail: [email protected] > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> For additional commands, e-mail: [email protected] > >>> > >> > >> > >> > >> -- > >> Sincerely yours > >> Mikhail Khludnev > >> Principal Engineer, > >> Grid Dynamics > >> > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <[email protected]>
