On Tue, Mar 27, 2012 at 5:22 PM, Franc Carter <franc.car...@sirca.org.au>wrote:

> On Tue, Mar 27, 2012 at 5:09 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Franc,
>>
>> With the given info, all we can tell is that it is possible but we
>> can't tell how as we have no idea how your data/dimensions/etc. are
>> structured. Being a little more specific would help.
>>
>
> Thanks, I'll go in to more detail.
>
> We have data for a large number of entities (10's of millions) for 15+
> years with fairly fine grained timestamp (but we could do just day
> granularity).
>
> At the extremes, some queries will need to select a small number of
> entities for all 15 years and some queries needing most of the entities for
> a small time range.
>
> Our current architecture (which we are reviewing) stores the data in 'day
> files' with a sort that increase the chance that data we want will be close
> together. We can then seek inside the files and only retrieve/process the
> parts we we need.
>
> I'd like to avoid Hadoop having to read and process all of every file to
> answer queries that don't need all the data.
>
> Is that clearer ?
>


I should also add that we know the entities and time range we are
interested  in at query submission time


>
>
>> It is possible to select and pass the right set of inputs per job, and
>> to also implement record readers to only read what is needed
>> specifically. This all depends on how your files are structured.
>>
>> Taking a wild guess, Apache Hive with its columnar storage (RCFile)
>> format may also be what you are looking for.
>>
>
> Thanks I'll have a look in to that
>
> cheers
>
>
>>
>> On Tue, Mar 27, 2012 at 11:32 AM, Franc Carter
>> <franc.car...@sirca.org.au> wrote:
>> > Hi,
>> >
>> > I'm very new to Hadoop and am working through how we may be able to
>> apply
>> > it to our data set.
>> >
>> > One of the things that I am struggling with is understanding if it is
>> > possible to pass tell Hadoop that only parts of the input file will be
>> > needed for a specific job. The reason I believe I may need this is that
>> we
>> > have two big dimensions in our data set. Queries may want only one of
>> these
>> > dimensions and while some un-needed reading is unavoidable there are
>> cases
>> > that reading the entire data set presents a very significant overhead.
>> >
>> > Or have I just misunderstood something ;-(
>> >
>> > thanks
>> >
>> > --
>> >
>> > *Franc Carter* | Systems architect | Sirca Ltd
>> >  <marc.zianideferra...@sirca.org.au>
>> >
>> > franc.car...@sirca.org.au | www.sirca.org.au
>> >
>> > Tel: +61 2 9236 9118
>> >
>> > Level 9, 80 Clarence St, Sydney NSW 2000
>> >
>> > PO Box H58, Australia Square, Sydney NSW 1215
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
>
> *Franc Carter* | Systems architect | Sirca Ltd
>  <marc.zianideferra...@sirca.org.au>
>
> franc.car...@sirca.org.au | www.sirca.org.au
>
> Tel: +61 2 9236 9118
>
> Level 9, 80 Clarence St, Sydney NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
>


-- 

*Franc Carter* | Systems architect | Sirca Ltd
 <marc.zianideferra...@sirca.org.au>

franc.car...@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

Reply via email to