Thanks Sagar, Mathias and Michael for your replies. It seems we will have to go with hadoop even if I/O will be slow due to our configuration.
I will try to update on how it worked for our case. Best, PA 2012/5/17 Michael Segel <michael_se...@hotmail.com> > The short answer is yes. > The longer answer is that you will have to account for the latencies. > > There is more but you get the idea.. > > Sent from my iPhone > > On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois" < > pad...@gmail.com> wrote: > > > We have large amount of text files that we want to process and index > (plus > > applying other algorithms). > > > > The problem is that our configuration is share-everything while hadoop > has > > a share-nothing configuration. > > > > We have 50 VMs and not actual servers, and these share a huge central > > storage. So using HDFS might not be really useful as replication will not > > help, distribution of files have no meaning as all files will be again > > located in the same HDD. I am afraid that I/O will be very slow with or > > without HDFS. So i am wondering if it will really help us to use > > hadoop/hbase/pig etc. to distribute and do several parallel tasks.. or is > > "better" to install something different (which i am not sure what). We > > heard myHadoop is better for such kind of configurations, have any clue > > about it? > > > > For example we now have a central mySQL to check if we have already > > processed a document and keeping there several metadata. Soon we will > have > > to distribute it as there is not enough space in one VM, But Hadoop/HBase > > will be useful? we don't want to do any complex join/sort of the data, we > > just want to do queries to check if already processed a document, and if > > not to add it with several of it's metadata. > > > > We heard sungrid for example is another way to go but it's commercial. We > > are somewhat lost.. so any help/ideas/suggestions are appreciated. > > > > Best, > > PA > > > > > > > > 2012/5/17 Abhishek Pratap Singh <manu.i...@gmail.com> > > > >> Hi, > >> > >> For your question if HADOOP can be used without HDFS, the answer is Yes. > >> Hadoop can be used with any kind of distributed file system. > >> But I m not able to understand the problem statement clearly to advice > my > >> point of view. > >> Are you processing text file and saving in distributed database?? > >> > >> Regards, > >> Abhishek > >> > >> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois < > >> pad...@gmail.com> wrote: > >> > >>> We want to distribute processing of text files.. processing of large > >>> machine learning tasks, have a distributed database as we have big > amount > >>> of data etc. > >>> > >>> The problem is that each VM can have up to 2TB of data (limitation of > >> VM), > >>> and we have 20TB of data. So we have to distribute the processing, the > >>> database etc. But all those data will be in a shared huge central file > >>> system. > >>> > >>> We heard about myHadoop, but we are not sure why is that any different > >> from > >>> Hadoop. > >>> > >>> If we run hadoop/mapreduce without using HDFS? is that an option? > >>> > >>> best, > >>> PA > >>> > >>> > >>> 2012/5/17 Mathias Herberts <mathias.herbe...@gmail.com> > >>> > >>>> Hadoop does not perform well with shared storage and vms. > >>>> > >>>> The question should be asked first regarding what you're trying to > >>> achieve, > >>>> not about your infra. > >>>> On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" < > >>>> pad...@gmail.com> wrote: > >>>> > >>>>> Hello, > >>>>> > >>>>> We have about 50 VMs and we want to distribute processing across > >> them. > >>>>> However these VMs share a huge data storage system and thus their > >>>> "virtual" > >>>>> HDD are all located in the same computer. Would Hadoop be useful for > >>> such > >>>>> configuration? Could we use hadoop without HDFS? so that we can > >>> retrieve > >>>>> and store everything in the same storage? > >>>>> > >>>>> Thanks, > >>>>> PA > >>>>> > >>>> > >>> > >> >