Thanks Greg, it looks that you have a fine granularity , even
providing the partition scheme, as its the case with rados.

i will take your advice and try to experiment, without any clustering
optimizations (at least not at application level) for my use case..
and get deeper if i need to..

its nice to know that you can have this kind of control, when you need it..

(you answer is a good post for the wiki, very elucidative and concise)

thanks.

On Wed, Apr 27, 2011 at 5:02 PM, Gregory Farnum
<[email protected]> wrote:
> On Wednesday, April 27, 2011 at 12:42 PM, Fabio Kaminski wrote:
> Ok, thats the way it should be.. :)
>>
>> But specializing a little more the question.. whats the data
>> partition scheme between nodes, how can a user control it.. block
>> level? file level?
>> suppose that i have agregations that i want it to stick together in
>> nodes... even if replicated in several nodes.. but always together, to
>> get network roundtrips diminished..
>
> The posix-compliant Ceph filesystem is built on top of the RADOS object 
> store. RADOS places objects into "placement groups" (PGs), and puts these PGs 
> on OSD storage nodes using a pseudo-random hash algorithm called CRUSH. 
> Generally this is based on the object's name (in Ceph the names are based on 
> inode numbers and which block of the file it is), although if you are using 
> the RADOS libraries to place data you can specify an alternative string to 
> hash on (look at the object_locator_t bits). This is there so that you can 
> place different pieces of data on the same node, but if you use this 
> interface you'll need to be able to provide the object_locator_t every time 
> you access the object.
> When using Ceph you don't have access to the object_locator_t method of 
> placing data, but you do have some options. Using either the libraries or the 
> cephfs tool you can specify the preferred PG and/or the pool for a file to be 
> placed in. (You can also use the cephfs tool on a directory, so that its 
> placement settings will apply to that directory's subtree).
> Setting the PG will keep data together, and each OSD node has some PGs which 
> are kept local to that OSD whenever it's up if you want local reads/writes 
> (generally these PGs are unused, but they can be valuable for doing things 
> like simulating HDFS behavior). Setting just the pool will let the system 
> properly distribute data, but you can set up the CRUSH map so that the pool 
> always roots at a specific node, or do any of a number of other things to 
> specify how you want the data to be laid out.
> Which interface you want to use for keeping data together depends a lot on 
> your exact use case and your skills.
>
> Lastly, I will say this: it is often the case that trying to keep specific 
> data co-located is not actually worth the trouble. You may have such a 
> use-case and it's always good to have tools that support such things, but eg 
> the Hadoop people have discovered that guaranteeing local writes is not 
> actually helpful to performance in most situations. Before going through the 
> hassle of setting up such a thing I'd be very sure that it matters to you and 
> that the default placement is unacceptable.
> -Greg
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to