Hi Andrew - sounds great. thank you for the workaround details.

I'll run some performance testing based on the flags you mentioned and look
into Kudu codes further on how the internal rowset works when comes to
handling large amount data. Once I have the test results, or thoughts (if
any) on how to support larger images/document files, I'll definitely share
it back.

thanks,

William.

On Fri, Sep 8, 2017 at 11:45 PM, Andrew Wong <[email protected]> wrote:

> Hi William,
>
> Glad to hear you're enjoying Kudu so far! Always nice to hear from excited
> Kudu-explorers.
>
> You're right about the current limitations of Kudu when it comes to binary
> storage. Due to these, the use-cases you mentioned for large files in Kudu
> are currently fairly uncommon. There are unsupported workarounds to this
> limit (e.g. --unlock_unsafe_flags --max_cell_size_bytes=<more than 64kb>),
> but values above 64KB are, as the name implies, unsafe, untested, and not
> openly supported. If you'd like to experiment with these, you're more than
> welcome to (and report back with what you find!).
>
> There may be others more qualified to discuss storing larger data in Kudu,
> but my understanding of it is that Kudu stores groups of rows together in a
> columnar format (rowsets), and roll these based on size. If one (or more)
> of these columns are particularly large, the resulting rowset might be a
> single row, and you might hit performance walls, etc. There may be more
> issues I'm not aware of. Overall it's just very different from what Kudu
> handles well at the moment (not to deter you, I am interested in seeing
> what you find if you do pursue this).
>
>
> Andrew
>
> On Fri, Sep 8, 2017 at 1:37 PM, William Li <[email protected]>
> wrote:
>
> > Hi All Kudu developers,
> >
> >
> >
> > I have been using kudu in projects. It’s been amazing. A few projects
> have
> > recently posted requirements on how to use Kudu store large binary files
> > (images, documents, etc). We used to propose Kudu + HDFS (or other file
> > system before) as a workaround but it is really a good solution. The main
> > scenario of the needs are
> >
> > 1). Use Kudu as the only storage layer. As we are storing larger amount
> of
> > data and growing the kudu cluster, the kudu cluster should support both
> > structured and unstructured data to avoid managing another storage tier
> for
> > images or documents.
> >
> > 2). It’s be great to simply the architecture from business application
> > point of view to have a single data access layer (either in Impala/Spark
> > SQL level, or at kudu API level) to manage business data object or entity
> > and its related images/documents.
> >
> >
> >
> > We are thinking to maybe to find ways to extend Kudu to support large
> > files, either through the current Binary data type, which there are size
> > limitations (64K) due to known issues, or maybe introduce new data type
> > like BLOB for storing images or documents that have sizes from a few
> > hundred KBs to a few MBs, or extend Kudu API to store the files into a
> file
> > system (which might be more suitable for even larger files). Many
> > relational DB or NoSQL DB have different levels of support, or different
> > design, like HBase, Cassandra, MapR-DB etc.
> >
> >
> >
> > I’d like ask your feedback or opinions:
> >
> > 1). Do you have a need to store larger content (like image or documents)
> > into Kudu (in MBs level)?
> >
> > 2). Do you have any opinions on storing the large content inside the
> > database or in file system?
> >
> >
> >
> > Much appreciated your comments. Thanks!
> >
>
>
>
> --
> Andrew Wong
>

Reply via email to