From our perspective we don’t really see use cases for DIH anymore.

Morphlines was developed primarily with Lucene in mind (even though it doesn’t 
require Lucene).

Flume Morphline Solr Sink handles streaming ingestion into Solr in reliable, 
scalable, flexible and loosely coupled ways, in separate processes. Neither 
Flume nor Morphlines requires Hadoop.

MapReduceIndexerTool uses Morphlines for reliable, scalable and flexible batch 
ingestion on Hadoop.

On Hadoop, even the JDBC/SQL portion of DIH now seems mostly covered by a 
combination of Sqoop and MapReduceIndexerTool, and perhaps a bit of Hive.

I’m not sure what the use cases for DIH still are these days.

(I wrote most of the Morphlines framework, Flume Morphline Solr Sink, 
MapReduceIndexerTool and the hbase-indexer-morphline integration.)

Just my 0.02c,
Wolfgang.

On Jun 11, 2014, at 1:05 PM, Dyer, James <[email protected]> wrote:

> Mikhail,
>  
> It would be nice if the DIH could be run separately from Solr (SOLR-853 and 
> others).  I think a lot of us have already expressed support for this, and at 
> one time I was looking into what it would take to complete.  Then again, 
> having watched the solr morphline sink be created for Flume, I realized there 
> are other teams out there possibly building an awesome DIH killer.  If that 
> happens, then we just saved ourselves a boatload of work, right?  I think if 
> someone out there can create a nice POC that uses a different tool, that 
> would be a great first step.
>  
> But there is also SOLR-3671 which was just committed as a follow-on to 
> SOLR-2382.  This makes DIH able to send documents to places other than Solr.  
> Turns out someone here is using DIH to import to Mongo.  (See SOLR-5981 for 
> details).  So we already have one side of the functionality to generalize DIH.
>  
> James Dyer
> Ingram Content Group
> (615) 213-4311
>  
> From: Mikhail Khludnev [mailto:[email protected]] 
> Sent: Wednesday, June 11, 2014 11:56 AM
> To: [email protected]
> Subject: Re: Adding Morphline support to DIH - worth the effort?
>  
> James,
> Don't you think that the spawning DIH2.0 as separate war is a priority?
>  
> 
> On Wed, Jun 11, 2014 at 6:39 PM, Dyer, James <[email protected]> 
> wrote:
> Alexandre,
> 
> I think that writing a new entity processor for DIH is a much less risky 
> thing to commit than, say, SOLR-4799.  Entity Processors work as plug-ins and 
> they aren't likely to break anything else.  So a Morphline EntityProcessor is 
> much more likely to be evaluated and committed.
> 
> But like anything else, you're going to need to explain what the need is and 
> what this new e.p. buys the user community.   There needs to be unit tests, 
> etc.
> 
> Besides this, if you can show how a morphline e.p. can be a step towards 
> migrating away from DIH entirely, then that would be a plus.  Perhaps create 
> a new solr example along the lines of the dih solr example that demonstrates 
> to users this new way forward.  This would go a long way in convincing the 
> community we have a viable alternative to dih.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:[email protected]]
> Sent: Tuesday, June 10, 2014 9:55 PM
> To: [email protected]
> Subject: Re: Adding Morphline support to DIH - worth the effort?
> 
> Ripples in the pond again. Spreading and dying. Understandable, but
> still somewhat annoying.
> 
> So, what would be the minimal viable next step to move this
> conversation forward? Something for 4.11 as opposed to 5.0?
> 
> Anyone with commit status has a feeling of what - minimal -
> deliverable they would put their own weight behind?
> 
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr 
> proficiency
> 
> 
> On Mon, Jun 9, 2014 at 10:50 AM, [email protected]
> <[email protected]> wrote:
> >> One of the ideas over DIH discussed earlier is making it standalone.
> >
> > Yeah; my beef with the DIH is that it’s tied to Solr.  But I’d rather see
> > something other than the DIH outside Solr; it’s not worthy IMO.  Why have
> > something Solr specific even?  A great pipeline shouldn’t tie itself to any
> > end-point.  There are a variety of solutions out there that I tried.  There
> > are the big 3 open-source ETLs: Kettle, Clover, Talend) and they aren’t
> > quite ideal in one way or another.  And Spring-Integration.  And some
> > half-baked data pipelines like OpenPipe & Open Pipeline.  I never got around
> > to taking a good look at Findwise’s open-sourced Hydra but I learned enough
> > to know to my surprise it was configured in code versus a config file (like
> > all the others) and that's a big turn-off to me.  Today I read through most
> > of the Morphlines docs and a few choice source files and I’m
> > super-impressed.  But as you note it’s missing a lot of other stuff.  I
> > think something great could be built using it as a core piece.
> >
> > ~ David Smiley
> > Freelance Apache Lucene/Solr Search Consultant/Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Sun, Jun 8, 2014 at 5:51 PM, Mikhail Khludnev
> > <[email protected]> wrote:
> >>
> >> Jack,
> >> I found your considerations quite reasonable.
> >> One of the ideas over DIH discussed earlier is making it standalone. So,
> >> if we start from simple Morphline UI, we can do this extraction. Then, such
> >> externalized ETL, will work better with Solr Cloud than DIH works now.
> >> Presumably we can reuse DIH Jdbc Datasources as a source for Morphline
> >> records.
> >> Still open questions in this approach are:
> >> - joins/caching - seem possible with Morphlines but still there is no such
> >> command
> >> - delta import - scenario we don't need to forget to handle it
> >> - threads (it's completely out Morphline's concerns)
> >> - distributed processing - it would be great if we can partition
> >> datasource eg something what's done by Scoop
> >> ... what else?
> >>
> >>
> >> On Sun, Jun 8, 2014 at 6:54 PM, Jack Krupansky <[email protected]>
> >> wrote:
> >>>
> >>> I've avoided DIH like the plague since it really doesn't fit well in
> >>> Solr, so I'm still baffled as to why you think we need to use DIH as the
> >>> foundation for a Solr Morphlines project. That shouldn't stop you, but
> >>> what's the big impediment to taking a clean slate approach to Morphlines -
> >>> learn what we can from DIH, but do a fresh, clean "Solr 5.0" 
> >>> implementation
> >>> that is not burdened from the get-go with all of DIH's baggage?
> >>>
> >>> Configuring DIH is one of its main problems, so blending Morphlines
> >>> config into DIH config would seem to just make Morphlines less attractive
> >>> than it actually is when viewed by itself.
> >>>
> >>> You might also consider how ManifoldCF (another Apache project) would
> >>> integrate with DIH and Morphlines as well. I mean, the core use case is 
> >>> ETL
> >>> from external data sources. And how all of this relates to Apache Flume as
> >>> well.
> >>>
> >>> But back to the original, still unanswered, question: Why use DIH as the
> >>> starting point for integrating Morphlines with Solr - unless the goal is 
> >>> to
> >>> make Morphlines unpalatable and less approachable than even DIH itself?!
> >>>
> >>> Another question: What does Elasticsearch have in this area (besides
> >>> "rivers")? Are they headed in the Morphlines direction as well?
> >>>
> >>>
> >>> -- Jack Krupansky
> >>>
> >>> -----Original Message----- From: Alexandre Rafalovitch
> >>> Sent: Sunday, June 8, 2014 10:16 AM
> >>>
> >>> To: [email protected]
> >>> Subject: Re: Adding Morphline support to DIH - worth the effort?
> >>>
> >>> I see DIH as something that offers a quick way to get things done, as
> >>> long as they fit into DIH's couple of basic scenarios. Going even a
> >>> little beyond hits bugs, bad documentation, inconsistencies and lack
> >>> of ongoing support (e.g. SOLR-4383).
> >>>
> >>> So, if it works for you - great. If it does not - too bad, use SolrJ.
> >>> And given what I observe, I believe the next round of improvements
> >>> might be easier to achieve by moving to a different open-source pipe
> >>> project than trying to keep reinventing and bandaging one of our own.
> >>> Go where strongest community is, etc.
> >>>
> >>> Morphline can be seen as a replacement for DIH's EntityProcessors and
> >>> Transformers (Flume adds other bits). The reasons I think it is worth
> >>> looking at are as follows:
> >>> 1) DIH is not really being maintained or further improved. So, the
> >>> list of EP and Transformers is the same and does not account for new
> >>> requests (which we see periodically on the mailing list); even the new
> >>> implementations get stuck in JIRA (see the JIRA in original email)
> >>> 2) It's not terribly well documented either, so people are always
> >>> struggling to understand how the entity is actually generated and what
> >>> happens when things go wrong
> >>> 3) We are already bundling Morphline jars with Solr. But we are NOT
> >>> using them in any way useful to a non-Hadoop Solr user. Which begs the
> >>> question why did we add them (one answer I guess: because we don't
> >>> have module system).
> >>> 4) Morphlines have more primitives than DIH and the available list keeps
> >>> growing
> >>> 5) What separate module for Solr? We have no discovery method for
> >>> modules. Writing one for general consumption is like trying to sing in
> >>> vacuum - the problem is a lot bigger that with individual offering.
> >>>
> >>> In terms of implementation, I think it take defining a custom
> >>> MorphlineEntityProcessor which basically plugs into DIH's current
> >>> DataSources. So, one could use for example DIH SqlDataSource to get a
> >>> list of files and then to handoff to Morphline's black box to parse
> >>> those files into records (e.g. Multiline records), augment them, etc.
> >>> Then, at the end, this gets handed back to DIH to finish it up. I
> >>> think this would work even with nested entities and transformers. The
> >>> Admin UI should also work
> >>>
> >>> Eventually, I think we need a harder discussion about DIH, so this
> >>> partial handover could be a way to test the waters.
> >>>
> >>> Does this make more sense?
> >>>
> >>> Regards,
> >>>   Alex.
> >>> Personal website: http://www.outerthoughts.com/
> >>> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >>> proficiency
> >>>
> >>>
> >>> On Sun, Jun 8, 2014 at 8:41 PM, Jack Krupansky <[email protected]>
> >>> wrote:
> >>>>
> >>>> It sounds more like an alternative to DIH rather than an incremental
> >>>> add-on
> >>>> to DIH. I mean, isn't Morphline really just "a DIH for Hadoop"?
> >>>>
> >>>> So, back to Shalin's question, which specific (please detail!) use cases
> >>>> of
> >>>> DIH are enhanced by Morphline?
> >>>>
> >>>> Maybe it would help if you simply elaborate what benefits would accrue
> >>>> to
> >>>> adding Morphline to DIH - as opposed to creating a separate module for
> >>>> Solr.
> >>>> I suppose it depends on whether you consider DIH a solid foundation or a
> >>>> weak link in Solr that desperately needs firming up.
> >>>>
> >>>> -- Jack Krupansky
> >>>>
> >>>> -----Original Message----- From: Alexandre Rafalovitch
> >>>> Sent: Sunday, June 8, 2014 1:40 AM
> >>>> To: [email protected]
> >>>> Subject: Re: Adding Morphline support to DIH - worth the effort?
> >>>>
> >>>>
> >>>> Well, it's the same core scenario as DIH supports (apart from actual
> >>>> data sources), but actively supported and developed by a company with
> >>>> a lot more investment in it. For the primitives supported, see
> >>>>
> >>>> http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html
> >>>>
> >>>> We don't bundle ALL of these with Solr, but I think we do bundle core,
> >>>> solr-core and solr-cell packages, which is a good number and range of
> >>>> functionality (e.g. readMultiLine).
> >>>>
> >>>> Regards,
> >>>>   Alex.
> >>>> Personal website: http://www.outerthoughts.com/
> >>>> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >>>> proficiency
> >>>>
> >>>>
> >>>> On Sun, Jun 8, 2014 at 12:23 PM, Shalin Shekhar Mangar
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>> I do not know much about morphlines but I'd like to know what use-cases
> >>>>> would be possible/easier/faster with such an integration?
> >>>>>
> >>>>>
> >>>>> On Sun, Jun 8, 2014 at 10:32 AM, Alexandre Rafalovitch
> >>>>> <[email protected]>
> >>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> I had a preliminary look around and it might be possible to plug
> >>>>>> Morphline (already shipped with Solr) into DIH by creating a bridging
> >>>>>> EntityProcessor.
> >>>>>>
> >>>>>> Two questions:
> >>>>>> 1) Do people see value in it?
> >>>>>> 2) DIH is not very supported, so any addition seems to be a bit stuck
> >>>>>> in "rickety bridge, don't rock" discussion (e.g. SOLR-4799). I don't
> >>>>>> want to suddenly be responsible for fixing the bridge before adding a
> >>>>>> standalone piece of code. So, if I write the code, how many general
> >>>>>> DIH externalities would I also have to address (e.g. lack of tests,
> >>>>>> etc)?
> >>>>>>
> >>>>>> Regards,
> >>>>>>    Alex.
> >>>>>> P.s. Morphline could also be integrated in update request processor
> >>>>>> chain. So, that could be an alternative project.
> >>>>>>
> >>>>>> Personal website: http://www.outerthoughts.com/
> >>>>>> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >>>>>> proficiency
> >>>>>>
> >>>>>> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>> For additional commands, e-mail: [email protected]
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Regards,
> >>>>> Shalin Shekhar Mangar.
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>
> >>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >> Principal Engineer,
> >> Grid Dynamics
> >>
> >>
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to