Thanks Sagar, Mathias and Michael for your replies.

It seems we will have to go with hadoop even if I/O will be slow due to our
configuration.

I will try to update on how it worked for our case.

Best,
PA



2012/5/17 Michael Segel <michael_se...@hotmail.com>

> The short answer is yes.
> The longer answer is that you will have to account for the latencies.
>
> There is more but you get the idea..
>
> Sent from my iPhone
>
> On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois" <
> pad...@gmail.com> wrote:
>
> > We have large amount of text files that we want to process and index
> (plus
> > applying other algorithms).
> >
> > The problem is that our configuration is share-everything while hadoop
> has
> > a share-nothing configuration.
> >
> > We have 50 VMs and not actual servers, and these share a huge central
> > storage. So using HDFS might not be really useful as replication will not
> > help, distribution of files have no meaning as all files will be again
> > located in the same HDD. I am afraid that I/O will be very slow with or
> > without HDFS. So i am wondering if it will really help us to use
> > hadoop/hbase/pig etc. to distribute and do several parallel tasks.. or is
> > "better" to install something different (which i am not sure what). We
> > heard myHadoop is better for such kind of configurations, have any clue
> > about it?
> >
> > For example we now have a central mySQL to check if we have already
> > processed a document and keeping there several metadata. Soon we will
> have
> > to distribute it as there is not enough space in one VM, But Hadoop/HBase
> > will be useful? we don't want to do any complex join/sort of the data, we
> > just want to do queries to check if already processed a document, and if
> > not to add it with several of it's metadata.
> >
> > We heard sungrid for example is another way to go but it's commercial. We
> > are somewhat lost.. so any help/ideas/suggestions are appreciated.
> >
> > Best,
> > PA
> >
> >
> >
> > 2012/5/17 Abhishek Pratap Singh <manu.i...@gmail.com>
> >
> >> Hi,
> >>
> >> For your question if HADOOP can be used without HDFS, the answer is Yes.
> >> Hadoop can be used with any kind of distributed file system.
> >> But I m not able to understand the problem statement clearly to advice
> my
> >> point of view.
> >> Are you processing text file and saving in distributed database??
> >>
> >> Regards,
> >> Abhishek
> >>
> >> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois <
> >> pad...@gmail.com> wrote:
> >>
> >>> We want to distribute processing of text files.. processing of large
> >>> machine learning tasks, have a distributed database as we have big
> amount
> >>> of data etc.
> >>>
> >>> The problem is that each VM can have up to 2TB of data (limitation of
> >> VM),
> >>> and we have 20TB of data. So we have to distribute the processing, the
> >>> database etc. But all those data will be in a shared huge central file
> >>> system.
> >>>
> >>> We heard about myHadoop, but we are not sure why is that any different
> >> from
> >>> Hadoop.
> >>>
> >>> If we run hadoop/mapreduce without using HDFS? is that an option?
> >>>
> >>> best,
> >>> PA
> >>>
> >>>
> >>> 2012/5/17 Mathias Herberts <mathias.herbe...@gmail.com>
> >>>
> >>>> Hadoop does not perform well with shared storage and vms.
> >>>>
> >>>> The question should be asked first regarding what you're trying to
> >>> achieve,
> >>>> not about your infra.
> >>>> On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" <
> >>>> pad...@gmail.com> wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> We have about 50 VMs and we want to distribute processing across
> >> them.
> >>>>> However these VMs share a huge data storage system and thus their
> >>>> "virtual"
> >>>>> HDD are all located in the same computer. Would Hadoop be useful for
> >>> such
> >>>>> configuration? Could we use hadoop without HDFS? so that we can
> >>> retrieve
> >>>>> and store everything in the same storage?
> >>>>>
> >>>>> Thanks,
> >>>>> PA
> >>>>>
> >>>>
> >>>
> >>
>

Reply via email to