On Wed, Sep 2, 2009 at 8:27 PM, Schubert Zhang <[email protected]> wrote:

>
> I want to get advice here from concept and theory level, and you may have
> such advice with your expert experiences.
>
> I want to share some of my application requirements here to let the
> question
> clear:
>
> - My dataset:
>  (1) structured data, about 200-500 bytes (<1KB) each row.
>  (2) time-series, but we cannot ensure the incoming data always in order by
> timestamp. maybe, the later coming input file with some rows which
> timestamp
> are before the preceding file.

 (3) data set is very big and fast continually. (e.g. 4 billion rows/2TB
> daily)
>


>
> - Processing
>  (1) Use mapreduce to ingest input raw data files, periodically. But we
> cannot ensure the current processing data is always later than the
> preceding
> processing.
>  (2) We want do analytical query(based on mapreduce) on time-ranges of
> these data.
>  (3) We also want to query (random) few a small set of rows from the
> dataset, with low latency (e.g. < or ~ 1 second). So we must use indexing
> (primary and/or secondary).
>
>
How many machines can you use for this job?

Do you need to keep it all?  Does some data expire (or can it be moved
offline)?

I see why you have timestamp as part of the key in your current hbase
cluster -- i.e. tall tables -- as you have no other choice currently.

It might make sense premaking the regions in the table.  Look at how many
regions were made the day before and go ahead and premake them to save
yourself having to ride over splits (I can show you how to write a little
script to do this).

Does the time-series data arrive roughly on time -- e.g. all instruments
emit the 4 o'clock readings at 4 o'clock or is there some flux in here?  In
other words, do you have a write rate of thousands of updates per second,
all carrying the same timestamp?

St.Ack




> Schubert
>
> On Thu, Sep 3, 2009 at 2:32 AM, Jonathan Gray <[email protected]> wrote:
>
> > @Sylvain
> >
> > If you describe your use case, perhaps we can help you to understand what
> > others are doing / have done similarly.  Event logging is certainly
> > something many of us have done.
> >
> > If you're wondering about how much load HBase can handle, provide some
> > numbers of what you expect.  How much data in bytes are associated with
> each
> > event, how many events per hour, and what operations do you want to do on
> > it?  We could help you determine how big of a cluster you might need and
> the
> > kind of write/read throughput you might see.
> >
> > @Schubert
> >
> > You do not need to partition your tables by stamp.  One possibility is to
> > put the stamp as the first part of your rowkeys, and in that way you will
> > have the table sorted by time.  Using Scan's start/stop keys, you can
> > prevent doing a full table scan.
> >
> It would not work. Since our data comes fastly. In the method only one
> region(server) are busy for writing. The throughput is bad for writing.
>
>
> >
> > For both of you... If you are storing massive amounts of streaming
> log-type
> > data, do you need full random read access to it?  If you just need to
> > process on subsets of time, that's easily partitioned by file. HBase
> should
> > be used if you need to *read* from it randomly, not streaming.  If you
> have
> > processing that HBase's inherent sorting, grouping, and indexing can
> benefit
> > from, then it also can make sense to use HBase in order to avoid
> full-scans
> > of data.
> >
>
> I know it is a contradiction between random-access and batch processing.
> But
> the features of HBase(sorting, distributed b-tree, merge/compaction) are
> very attractive.
>
>
> >
> > HBase is not the answer because of lack of HDFS append.  You could buffer
> > in something outside HDFS, close files after a certain size/time (this
> his
> > what hbase does now, we can have data loss because of no
> > appends as well), etc...
> >
> > Reads/writes of lots of streaming data to HBase will always be slower
> than
> > HDFS.  HBase adds additional buffering, and the compaction/split
> processes
> > actually mean you copy the same data multiple times (probably 3-4 times
> avg
> > which lines up with the 3-4x slowdown you see).
> >
> >
> > And there is currently a patch in development (that works at least
> > partially) to do direct-to-hdfs imports to HBase which would then be
> nearly
> > as fast as a normal HDFS writing job.
> >
> > Issue here:  https://issues.apache.org/jira/browse/HBASE-48
> >
> >
> > JG
> >
> >
> > Sylvain Hellegouarch wrote:
> >
> >> I must admit, I'm left as puzzled as you are. Our current use case at
> work
> >> involve large amount of small event log writing. Of course HDFS was
> quickly
> >> out of question since it's not there yet to append to a file and more
> >> generally to handle large amount of small write ops.
> >>
> >> So we decided with HBase because we trust the Hadoop/HBase
> infrastructure
> >> will offer us the robustness and reliability we need. That being said,
> I'm
> >> not feeling at ease in regards to the capacity of HBase to handle the
> >> potential load we are looking at inputing.
> >>
> >> In fact, it's a common treat of such systems, they've been designed with
> a
> >> certain use case in mind and sometimes I feel like their design and
> >> implementation leak way too much on our infrastructure, leading us down
> the
> >> path of a virtual lock-in.
> >>
> >> Now I am not accusing anyone here, just observing that I find it really
> >> hard to locate any industrial story of those systems in a similar use
> case
> >> we have at hand.
> >>
> >> The number of nodes this or that company has doesn't quite interest me
> as
> >> much as the way they are actually using HBase and Hadoop.
> >>
> >> RDBMS don't scale as well but they've got a long history and people do
> >> know how to optimise, use and manage them. It seems column-oriented
> database
> >> systems are still young :)
> >>
> >> - Sylvain
> >>
> >> Schubert Zhang a écrit :
> >>
> >>> Regardless Cassandra, I want to discuss some questions about
> >>> HBase/Bigtable.  Any advices are expected.
> >>>
> >>> Regards runing MapReduce to scan/analyze big data in HBase.
> >>>
> >>> Compared to sequentially reading data from HDFS files directly,
> >>> scan/sequential-reading data from HBase is slower. (As my test, at
> least
> >>> 3:1
> >>> or 4:1).
> >>>
> >>> For the data in HBase, it is diffcult to only analyze specified part of
> >>> data. For example, it is diffcult to only analyze the recent one day of
> >>> data. In my application, I am considering partition data into different
> >>> HBase tables (e.g. one day - one table), then, I can only touch one
> table
> >>> for analyze via MapReduce.
> >>> In Google's Bigtable paper, in the "8.1 Google Analytics", they also
> >>> discribe this usage, but I don't know how.
> >>>
> >>> It is also slower to put flooding data into HBase table than writing to
> >>> files. (As my test, at least 3:1 or 4:1 too). So, maybe in the future,
> >>> HBase
> >>> can provide a bulk-load feature, like PNUTS?
> >>>
> >>> Many people suggest us to only store metadata into HBase tables, and
> >>> leave
> >>> data in HDFS files, because our time-series dataset is very big.  I
> >>> understand this idea make sense for some simple application
> requirements.
> >>> But usually, I want different indexes to the raw data. It is diffcult
> to
> >>> build such indexes if the the raw data files (which are raw or are
> >>> reconstructed via MapReduce  periodically on recent data ) are not
> >>> totally
> >>> sorted.  .... HBase can provide us many expected features: sorted,
> >>> distributed b-tree, compact/merge.
> >>>
> >>> So, it is very difficult for me to make trade-off.
> >>> If I store data in HDFS files (may be partitioned), and metadata/index
> in
> >>> HBase. The metadata/index is very difficult to be build.
> >>> If I rely on HBase totally, the performance of ingesting-data and
> >>> scaning-data is not good. Is it reasonable to do MapReduce on HBase? We
> >>> know
> >>> the goal of HBase is to provide random access over HDFS, and it is a
> >>> extention or adaptor over HDFS.
> >>>
> >>> ----
> >>> Many a time, I am thinking, maybe we need a data storage engine, which
> >>> need
> >>> not so strong consistency, and it can provide better writing and
> >>> reading throughput like HDFS. Maybe, we can design another system like
> a
> >>> simpler HBase ?
> >>>
> >>> Schubert
> >>>
> >>> On Wed, Sep 2, 2009 at 8:56 AM, Andrew Purtell <[email protected]>
> >>> wrote:
> >>>
> >>>
> >>>
> >>>> To be precise, S3. http://status.aws.amazon.com/s3-20080720.html
> >>>>
> >>>>  - Andy
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ________________________________
> >>>> From: Andrew Purtell <[email protected]>
> >>>> To: [email protected]
> >>>> Sent: Tuesday, September 1, 2009 5:53:09 PM
> >>>> Subject: Re: Cassandra vs HBase
> >>>>
> >>>>
> >>>> Right... I recall an incident in AWS where a malformed gossip packet
> >>>> took
> >>>> down all of Dynamo. Seems that even P2P doesn't mitigate against
> corner
> >>>> cases.
> >>>>
> >>>>
> >>>> On Tue, Sep 1, 2009 at 3:12 PM, Jonathan Ellis <[email protected]>
> >>>> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> The big win for Cassandra is that its p2p distribution model -- which
> >>>>> drives the consistency model -- means there is no single point of
> >>>>> failure.  SPF can be mitigated by failover but it's really, really
> >>>>> hard to get all the corner cases right with that approach.  Even
> >>>>> Google with their 3 year head start and huge engineering resources
> >>>>> still has trouble with that occasionally.  (See e.g.
> >>>>> http://groups.google.com/group/google-appengine/msg/ba95ded980c8c179
> .)
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>
> >>
>

Reply via email to