You could run the flume collectors on other machines and write a source which connects to the sockets on the data generators.
-Joey On Dec 15, 2011, at 21:27, "Periya.Data" <[email protected]> wrote: > Sorry...misworded my statement. What I meant was that the sources are meant > to be untouched and admins do not want to mess with it and add more tools in > there. All I've got is source addresses, port numbers. Once I know what > technique(s) I will be using, accordingly, I will be given access via > firewalls and other access credentials. > > > -PD > > On Thu, Dec 15, 2011 at 5:05 PM, Russell Jurney <[email protected]> > wrote: > Just curious - what is the situation you're in where no collectors are > possible? Sounds interesting. > > Russell Jurney > twitter.com/rjurney > [email protected] > datasyndrome.com > > On Dec 15, 2011, at 5:01 PM, "Periya.Data" <[email protected]> wrote: > > > Hi all, > > I would like to know what options I have to ingest terabytes of data > > that are being generated very fast from a small set of sources. I have > > thought about : > > > > 1. Flume > > 2. Have an intermediate staging server(s) where you can offload data and > > from there use dfs -put to load into HDFS. > > 3. Anything else?? > > > > Suppose I am unable to use Flume (since the sources do not support their > > installation) and suppose that I do not have the luxury of having an > > intermediate staging place, what options do I have? In this case, I might > > have to directly (preferably in parallel) ingest data into HDFS. > > > > I have read about a technique to use Map-Reduce where the map would read > > data and use JAVA API to store in HDFS. We could have multiple threads of > > maps to get parallel ingestion. It would be nice to know about ways to > > ingest data "directly" into HDFS considering my assumptions. > > > > Suggestions are appreciated, > > > > /PD. >
