You used HDFS too? or storing everything on SAN immediately? I don't have number of GB/TB (it might be about 2TB so not really that "huge") but they are more than 100 million documents to be processed. In a single machine currently we can process about 200.000 docs/day (several parsing, indexing, metadata extraction has to be done). So in the worst case we want to use the 50 VMs to distribute the processing..
2012/5/17 Sagar Shukla <sagar_shu...@persistent.co.in> > Hi PA, > In my environment, we had a SAN storage and I/O was pretty good. So if > you have similar environment then I don't see any performance issues. > > Just out of curiosity - what amount of data are you looking forward to > process ? > > Regards, > Sagar > > -----Original Message----- > From: Pierre Antoine Du Bois De Naurois [mailto:pad...@gmail.com] > Sent: Thursday, May 17, 2012 8:29 PM > To: common-user@hadoop.apache.org > Subject: Re: is hadoop suitable for us? > > Thanks Sagar, Mathias and Michael for your replies. > > It seems we will have to go with hadoop even if I/O will be slow due to > our configuration. > > I will try to update on how it worked for our case. > > Best, > PA > > > > 2012/5/17 Michael Segel <michael_se...@hotmail.com> > > > The short answer is yes. > > The longer answer is that you will have to account for the latencies. > > > > There is more but you get the idea.. > > > > Sent from my iPhone > > > > On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois" < > > pad...@gmail.com> wrote: > > > > > We have large amount of text files that we want to process and index > > (plus > > > applying other algorithms). > > > > > > The problem is that our configuration is share-everything while > > > hadoop > > has > > > a share-nothing configuration. > > > > > > We have 50 VMs and not actual servers, and these share a huge > > > central storage. So using HDFS might not be really useful as > > > replication will not help, distribution of files have no meaning as > > > all files will be again located in the same HDD. I am afraid that > > > I/O will be very slow with or without HDFS. So i am wondering if it > > > will really help us to use hadoop/hbase/pig etc. to distribute and > > > do several parallel tasks.. or is "better" to install something > > > different (which i am not sure what). We heard myHadoop is better > > > for such kind of configurations, have any clue about it? > > > > > > For example we now have a central mySQL to check if we have already > > > processed a document and keeping there several metadata. Soon we > > > will > > have > > > to distribute it as there is not enough space in one VM, But > > > Hadoop/HBase will be useful? we don't want to do any complex > > > join/sort of the data, we just want to do queries to check if > > > already processed a document, and if not to add it with several of > it's metadata. > > > > > > We heard sungrid for example is another way to go but it's > > > commercial. We are somewhat lost.. so any help/ideas/suggestions are > appreciated. > > > > > > Best, > > > PA > > > > > > > > > > > > 2012/5/17 Abhishek Pratap Singh <manu.i...@gmail.com> > > > > > >> Hi, > > >> > > >> For your question if HADOOP can be used without HDFS, the answer is > Yes. > > >> Hadoop can be used with any kind of distributed file system. > > >> But I m not able to understand the problem statement clearly to > > >> advice > > my > > >> point of view. > > >> Are you processing text file and saving in distributed database?? > > >> > > >> Regards, > > >> Abhishek > > >> > > >> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois > > >> < pad...@gmail.com> wrote: > > >> > > >>> We want to distribute processing of text files.. processing of > > >>> large machine learning tasks, have a distributed database as we > > >>> have big > > amount > > >>> of data etc. > > >>> > > >>> The problem is that each VM can have up to 2TB of data (limitation > > >>> of > > >> VM), > > >>> and we have 20TB of data. So we have to distribute the processing, > > >>> the database etc. But all those data will be in a shared huge > > >>> central file system. > > >>> > > >>> We heard about myHadoop, but we are not sure why is that any > > >>> different > > >> from > > >>> Hadoop. > > >>> > > >>> If we run hadoop/mapreduce without using HDFS? is that an option? > > >>> > > >>> best, > > >>> PA > > >>> > > >>> > > >>> 2012/5/17 Mathias Herberts <mathias.herbe...@gmail.com> > > >>> > > >>>> Hadoop does not perform well with shared storage and vms. > > >>>> > > >>>> The question should be asked first regarding what you're trying > > >>>> to > > >>> achieve, > > >>>> not about your infra. > > >>>> On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" < > > >>>> pad...@gmail.com> wrote: > > >>>> > > >>>>> Hello, > > >>>>> > > >>>>> We have about 50 VMs and we want to distribute processing across > > >> them. > > >>>>> However these VMs share a huge data storage system and thus > > >>>>> their > > >>>> "virtual" > > >>>>> HDD are all located in the same computer. Would Hadoop be useful > > >>>>> for > > >>> such > > >>>>> configuration? Could we use hadoop without HDFS? so that we can > > >>> retrieve > > >>>>> and store everything in the same storage? > > >>>>> > > >>>>> Thanks, > > >>>>> PA > > >>>>> > > >>>> > > >>> > > >> > > > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. >