Hello Matt, Thank you so much for your detailed reply, I really appreciate it!
Yes in future we might have something in the content that can decide whether there will be a new partition. However as of now it is simply date - So whenever there is a change in date it should be triggered. Present scenario is whenever date changes, "putHDFS" automatically creates a new directory and put data there. But can we have a logic to identify this date change and trigger a new processor to run? Thank you! ______________________ *Kind Regards,* *Anshuman Ghosh* *Contact - +49 179 9090964* On Thu, Mar 23, 2017 at 4:43 PM, Matt Burgess <[email protected]> wrote: > Anshuman, > > For #1, is there a way from the content of a file destined for HDFS > that you can tell whether a new partition will be introduced? If so, > then after PutHDFS, you could make that decision (with a > RouteOnContent or ExtractText->RouteOnAttribute), and for new > partitions, you can route them to an ExecuteStreamCommand where you > ignore STDIN and instead issue your hive repair command. This will > make things more event-driven rather than having to periodically try a > repair from a separate part of the flow. > > For #2, we chose a separate set of Hive processors (rather than adding > support for the Hive driver to the generic SQL processors) for at > least two reasons. First, we wanted to include the Hive driver so the > admin/user did not need to provide their own. The Hive dependencies > are quite large and don't really belong in the standard NAR, so that > lent itself to the need for a new NAR. Another reason is that the Hive > JDBC driver does not support some methods that were deemed important > and necessary for processors like ExecuteSQL, such as > setQueryTimeout(). Rather than ignoring the provided value, the Hive > driver throws a SQLException. Discussion in the community indicated > that it was better (along with the first reason) to have separate > processors rather than making the SQL processors behave inconsistently > based on the level of driver support. Having said that, there is at > least one Pull Request out there [1] that does exactly that, so > perhaps it is time to revisit the discussion to see if there has been > any change in opinions. > > Regards, > Matt > > [1] https://github.com/apache/nifi/pull/1281 > > > On Thu, Mar 23, 2017 at 5:47 AM, Anshuman Ghosh > <[email protected]> wrote: > > Hello Matt, > > > > Thank you for your reply! > > > > With "ExecuteProcess", I am able to execute a command (Hive/ Beeline) > > Actually our use case is relatively simple, so if you have any other > > suggestions that would be helpful. > > > > We are writing to a HDFS directory which is the location for an external > > Hive table. However when we write/ introduce a new partition, we need to > > execute a repair in order to update the metadata. I was wondering if > there > > is any better way to do this (apart from executing the command separately > > through a processor and that too we are not sure about the frequency of > this > > execution) which you are aware of. > > Can't we use generic JDBC driver to connect to Hive and execute commands > > like we do for any other Database (like we did for PostgreSQL also) > > > > > > Thanking you in advance! > > > > ______________________ > > > > Kind Regards, > > Anshuman Ghosh > > Contact - +49 179 9090964 > > > > > > On Wed, Mar 22, 2017 at 6:27 PM, Matt Burgess <[email protected]> > wrote: > >> > >> Anshuman, > >> > >> According to [1], it looks like CDH 5.10 also uses an Apache Hive > >> 1.1.0 baseline, and looking through the changes [2] I didn't see > >> anything related to the client_protocol field being added. You are > >> right that ExecuteProcess should also work with a beeline command, the > >> major difference is that ExecuteProcess does not accept an incoming > >> flow file and ExecuteStreamCommand does. One thing I should mention, > >> if your Hive query/statement is going to generate a lot of output (due > >> to a long-running MapReduce job, for example), you may want to use the > >> --silent command line option to suppress the output. Otherwise the > >> ExecuteProcess and/or ExecuteStreamCommand processor have been known > >> to hang on large outputs. > >> > >> Regards, > >> Matt > >> > >> [1] > >> https://www.cloudera.com/documentation/enterprise/ > release-notes/topics/cdh_vd_cdh_package_tarball_510.html > >> [2] > >> http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.10. > 0.CHANGES.txt?_ga=1.60219309.1838615776.1489495012 > >> > >> > >> On Wed, Mar 22, 2017 at 12:42 PM, Anshuman Ghosh > >> <[email protected]> wrote: > >> > Hello Matt, > >> > > >> > Thank you very much for your reply! > >> > I guess "ExecuteProcess" should also work with a beeline command? > >> > However do you know whether CDH 5.10 is having higher Hive version or > >> > not? > >> > > >> > Thanking you in advance! > >> > > >> > > >> > ______________________ > >> > > >> > *Kind Regards,* > >> > *Anshuman Ghosh* > >> > *Contact - +49 179 9090964* > >> > > >> > > >> > On Wed, Mar 22, 2017 at 4:43 PM, Matt Burgess <[email protected]> > >> > wrote: > >> > > >> >> Anshuman, > >> >> > >> >> The Hive processors use Apache Hive 1.2.0, which is not compatible > >> >> with Hive 1.1.0 and is thus a known issue against clusters that use > >> >> Hive 1.1.0 such as CDH 5.9. Unfortunately there were API/code > changes > >> >> between Hive 1.1.0 and Hive 1.2.0, which means there is no simple > >> >> workaround with respect to the Hive processors. The Hive NAR would > >> >> have to be rebuilt (and its code changed) to use Hive 1.1.0. > >> >> > >> >> One possible workaround is to use ExecuteStreamCommand and the > >> >> command-line hive client (hive, beeline, etc.) to execute HiveQL > >> >> statements. This is not ideal but should work for getting the > >> >> statements executed. > >> >> > >> >> Regards, > >> >> Matt > >> >> > >> >> > >> >> On Wed, Mar 22, 2017 at 11:34 AM, Anshuman Ghosh > >> >> <[email protected]> wrote: > >> >> > Hello everyone, > >> >> > > >> >> > I am trying to use this "PutHiveQL" processor. > >> >> > However no luck with the connection string, seems like I am missing > >> >> > out > >> >> on > >> >> > something. > >> >> > > >> >> > I am getting an error like "Required field 'client_protocol' is > >> >> > unset!" > >> >> > Please find the attachments for the error message and also config > >> >> property. > >> >> > > >> >> > BTW, I am using Hive 1.1.0 which is packaged with CDH 5.9. Can that > >> >> > be a > >> >> > reason? > >> >> > What would be the work around? > >> >> > > >> >> > > >> >> > Thank > >> >> > ing > >> >> > you > >> >> > in advance > >> >> > ! > >> >> > ______________________ > >> >> > > >> >> > Kind Regards, > >> >> > Anshuman Ghosh > >> >> > Contact - +49 179 9090964 > >> >> > > >> >> > > > > >
