Re: [Lustre-discuss] Poor multithreaded I/O performance

kmehta Thu, 26 May 2011 12:00:19 -0700

Ok I ran the following tests:

[1]
Application spawns 8 threads. I write to Lustre having 8 OSTs.
Each thread writes data in blocks of 1 Mbyte in a round robin fashion, i.e.


T0 writes to offsets 0, 8MB, 16MB, etc.
T1 writes to offsets 1MB, 9MB, 17MB, etc.
The stripe size being 1MByte, every thread ends up writing to only 1 OST.

I see a bandwidth of 280 Mbytes/sec, similar to the single thread
performance.

[2]
I also ran the same test such that every thread writes data in blocks of 8
Mbytes for the same stripe size. (Thus, every thread will write to every
OST). I still get similar performance, ~280Mbytes/sec, so essentially I
see no difference between each thread writing to a single OST vs each
thread writing to all OSTs.

And as I said before, if all threads write to their own separate file, the
resulting bandwidth is ~700Mbytes/sec.

I have attached my C file (simple_io_test.c) herewith. Maybe you could run
it and see where the bottleneck is. Comments and instructions for
compilation have been included in the file. Do let me know if you need any
clarification on that.

Your help is appreciated,
Kshitij

> This is what my application does:
>
> Each thread has its own file descriptor to the file.
> I use pwrite to ensure non-overlapping regions, as follows:
>
> Thread 0, data_size: 1MB, offset: 0
> Thread 1, data_size: 1MB, offset: 1MB
> Thread 2, data_size: 1MB, offset: 2MB
> Thread 3, data_size: 1MB, offset: 3MB
>
> <repeat cycle>
> Thread 0, data_size: 1MB, offset: 4MB
> and so on (This happens in parallel, I dont wait for one cycle to end
> before the next one begins).
>
> I am gonna try the following:
> a)
> Instead of a round-robin distribution of offsets, test with sequential
> offsets:
> Thread 0, data_size: 1MB, offset:0
> Thread 0, data_size: 1MB, offset:1MB
> Thread 0, data_size: 1MB, offset:2MB
> Thread 0, data_size: 1MB, offset:3MB
>
> Thread 1, data_size: 1MB, offset:4MB
> and so on. (I am gonna keep these separate pwrite I/O requests instead of
> merging them or using writev)
>
> b)
> Map the threads to the no. of OSTs using some modulo, as suggested in the
> email below.
>
> c)
> Experiment with fewer no. of OSTs (I currently have 48).
>
> I shall report back with my findings.
>
> Thanks,
> Kshitij
>
>> [Moved to Lustre-discuss]
>>
>>
>> "However, if I spawn 8 threads such that all of them write to the same
>> file (non-overlapping locations), without explicitly synchronizing the
>> writes (i.e. I dont lock the file handle)"
>>
>>
>> How exactly does your multi-threaded application write the data?  Are
>> you using pwrite to ensure non-overlapping regions or are they all just
>> doing unlocked write() operations on the same fd to each write (each
>> just transferring size/8)?  If it divides the file into N pieces, and
>> each thread does pwrite on its piece, then what each OST sees are
>> multiple streams at wide offsets to the same object, which could impact
>> performance.
>>
>> If on the other hand the file is written sequentially, where each thread
>> grabs the next piece to be written (locking normally used for the
>> current_offset value, so you know where each chunk is actually going),
>> then you get a more sequential pattern at the OST.
>>
>> If the number of threads maps to the number of OSTs (or some modulo,
>> like in your case 6 OSTs per thread), and each thread "owns" the piece
>> of the file that belongs to an OST (ie: for (offset = thread_num * 6MB;
>> offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then
>> you've eliminated the need for application locks (assuming the use of
>> pwrite) and ensured each OST object is being written sequentially.
>>
>> It's quite possible there is some bottleneck on the shared fd.  So
>> perhaps the question is not why you aren't scaling with more threads,
>> but why the single file is not able to saturate the client, or why the
>> file BW is not scaling with more OSTs.  It is somewhat common for
>> multiple processes (on different nodes) to write non-overlapping regions
>> of the same file; does performance improve if each thread opens its own
>> file descriptor?
>>
>> Kevin
>>
>>
>> Wojciech Turek wrote:
>>> Ok so it looks like you have in total 64 OSTs and your output file is
>>> striped across 48 of them. May I suggest that you limit number of
>>> stripes, lets say a good number to start with would be 8 stripes and
>>> also for best results use OST pools feature to arrange that each
>>> stripe goes to OST owned by different OSS.
>>>
>>> regards,
>>>
>>> Wojciech
>>>
>>> On 23 May 2011 23:09, <[email protected] <mailto:[email protected]>>
>>> wrote:
>>>
>>>     Actually, 'lfs check servers' returns 64 entries as well, so I
>>>     presume the
>>>     system documentation is out of date.
>>>
>>>     Again, I am sorry the basic information had been incorrect.
>>>
>>>     - Kshitij
>>>
>>>     > Run lfs getstripe <your_output_file> and paste the output of
>>>     that command
>>>     > to
>>>     > the mailing list.
>>>     > Stripe count of 48 is not possible if you have max 11 OSTs (the
>>>     max stripe
>>>     > count will be 11)
>>>     > If your striping is correct, the bottleneck can be your client
>>>     network.
>>>     >
>>>     > regards,
>>>     >
>>>     > Wojciech
>>>     >
>>>     >
>>>     >
>>>     > On 23 May 2011 22:35, <[email protected]
>>>     <mailto:[email protected]>> wrote:
>>>     >
>>>     >> The stripe count is 48.
>>>     >>
>>>     >> Just fyi, this is what my application does:
>>>     >> A simple I/O test where threads continually write blocks of size
>>>     >> 64Kbytes
>>>     >> or 1Mbyte (decided at compile time) till a large file of say,
>>>     16Gbytes
>>>     >> is
>>>     >> created.
>>>     >>
>>>     >> Thanks,
>>>     >> Kshitij
>>>     >>
>>>     >> > What is your stripe count on the file,  if your default is 1,
>>>     you are
>>>     >> only
>>>     >> > writing to one of the OST's.  you can check with the lfs
>>>     getstripe
>>>     >> > command, you can set the stripe bigger, and hopefully your
>>>     >> wide-stripped
>>>     >> > file with threaded writes will be faster.
>>>     >> >
>>>     >> > Evan
>>>     >> >
>>>     >> > -----Original Message-----
>>>     >> > From: [email protected]
>>>     <mailto:[email protected]>
>>>     >> > [mailto:[email protected]
>>>     <mailto:[email protected]>] On Behalf Of
>>>     >> > [email protected] <mailto:[email protected]>
>>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>>     >> > To: [email protected]
>>>     <mailto:[email protected]>
>>>     >> > Subject: [Lustre-community] Poor multithreaded I/O performance
>>>     >> >
>>>     >> > Hello,
>>>     >> > I am running a multithreaded application that writes to a
>>> common
>>>     >> shared
>>>     >> > file on lustre fs, and this is what I see:
>>>     >> >
>>>     >> > If I have a single thread in my application, I get a bandwidth
>>> of
>>>     >> approx.
>>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I
>>>     spawn 8
>>>     >> > threads such that all of them write to the same file
>>>     (non-overlapping
>>>     >> > locations), without explicitly synchronizing the writes (i.e.
>>>     I dont
>>>     >> lock
>>>     >> > the file handle), I still get the same bandwidth.
>>>     >> >
>>>     >> > Now, instead of writing to a shared file, if these threads
>>>     write to
>>>     >> > separate files, the bandwidth obtained is approx. 700
>>> Mbytes/sec.
>>>     >> >
>>>     >> > I would ideally like my multithreaded application to see
>>> similar
>>>     >> scaling.
>>>     >> > Any ideas why the performance is limited and any workarounds?
>>>     >> >
>>>     >> > Thank you,
>>>     >> > Kshitij
>>>     >> >
>>>     >> >
>>>     >> > _______________________________________________
>>>     >> > Lustre-community mailing list
>>>     >> > [email protected]
>>>     <mailto:[email protected]>
>>>     >> > http://lists.lustre.org/mailman/listinfo/lustre-community
>>>     >> >
>>>     >>
>>>     >>
>>>     >> _______________________________________________
>>>     >> Lustre-community mailing list
>>>     >> [email protected]
>>>     <mailto:[email protected]>
>>>     >> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Lustre-community mailing list
>>> [email protected]
>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>
>>
>
>

/*
    This is a simple IO program to test the b/w of a filesystem.

    It creates a shared matrix/threadprivate matrix and writes it to a shared/separate file as specified by compile-time options -D_USE_THREAD_PRIVATE_DATA and -D_SEP_FILES. In the first case, each thread creates its own 1GB/num_threads size matrix and writes it to the file. In case of separate files, each file will be of size file_size and the b/w is reported accordingly. 
    
    In case of writing to a shared file from a thread's private matrix, data is written in a contiguous fashion to file, i.e. all data from T0 is written before data from T1, and so on. When writing to a shared file from a common global matrix, data is written in a round-robin manner where T0 writes to offset 0, T1 writes to offset 1, and so on. 
    
    Input arguments are: file_size seg_size filename num_threads -D_DEBUG -D_USE_THREAD_PRIVATE_DATA -D_SEP_FILES
*/

#define _XOPEN_SOURCE 600

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <pthread.h>
#include <sys/time.h>
#ifdef _OPENMP
#include <omp.h>
#endif

long file_size;
long seg_size;
int seg_rows;   //not read from cmd line.
char* filename;
int num_threads;
char** matrix, **filename_array;
int DEBUG = 0;
pthread_t* threads;
int *threadids;
struct timeval tv;
pthread_barrier_t barr;

void allocate_2D_matrix(char***, int, long);
void create_threads();
void *pthread_start_routine(void*);
void pthread_task(int);
void create_filename_array();
void read_cmd_args(int, char**);
void print_input_args();
void print_default_input_format();
double timestamp();
void cleanup();
void validate_data_dump(char** matrix, char* filename);

int main(int argc, char** argv)
{
    read_cmd_args(argc, argv);
#ifdef _SEP_FILES
    create_filename_array(); //Array to store all filenames
#endif
    pthread_barrier_init(&barr, NULL, num_threads);
#ifndef _USE_THREAD_PRIVATE_DATA
    allocate_2D_matrix(&matrix, seg_rows, seg_size); //Create a global matrix
#endif
    create_threads();
    pthread_task(0);
    return 0;
}

/*
Create a 2-D matrix. It will have seg_rows, each of size seg_size
*/
void allocate_2D_matrix(char*** matrix, int seg_rows, long seg_size)
{
    int i;

    //create the matrix
    char* data = (char *) calloc (1, seg_rows * seg_size * sizeof(char));
    if(data == NULL)
    {
        printf("\nCould not allocate memory for matrix.\n");
        exit(1);
    }

    *matrix = (char **) malloc (seg_rows * sizeof(char *));
    if(*matrix == NULL)
    {
        printf("\nCould not allocate memory for matrix.\n");
        exit(1);
    }

#pragma omp parallel for  
    for(i=0; i<seg_rows; i++)
    {
        (*matrix)[i] = &(data[i*seg_size]);
    }
    //initialize the matrix
    double tempt = timestamp();
    srand(tempt);

#pragma omp parallel for 
    for(i = 0; i<seg_rows; i++)
    {
        int j;
        for(j = 0; j<(seg_size/sizeof(char)); j++)
            (*matrix)[i][j] = rand() % 100;
    }
}

void create_threads()
{    
    int i;
    threads = (pthread_t *) malloc ((num_threads-1) * sizeof(pthread_t));

    threadids = (int *) malloc ((num_threads-1)*sizeof(int));
    if(DEBUG) puts("Spawning threads now...");

    for(i=0; i<num_threads-1; i++)
        threadids[i] = i+1;
    for(i=0; i<num_threads-1; i++)
    {
        int rc;
        rc = pthread_create(&threads[i], NULL, pthread_start_routine, &threadids[i]);  //REPLACE threadids[i] WITH ACTUAL THREAD ARGS
        if (rc)
        {
            printf("\nError creating threads.\n");
            free(threadids);
            exit(1);
        }
    }
    if(DEBUG) puts("Threads spawned.");
}

void *pthread_start_routine(void *threadid)
{
    int tid = (int)(*(int *)threadid);
    if(DEBUG) printf("Thread %d initialized\n", tid);
    pthread_task(tid);
    pthread_exit(NULL);
}

void pthread_task(int tid)
{
    double t1, t2;
    int i, fd, index, iterations;
    long offset, offset_gap;
    char errnobuf[256];
	char **my_matrix;
    char* my_filename;

#ifdef _SEP_FILES
	iterations = file_size/seg_size; //so that each file will be size file_size
    my_filename = filename_array[tid];
#else
	iterations = file_size / (seg_size * num_threads); //so that the resulting file should be of size file_size
    my_filename = filename;
#endif

#ifdef _USE_THREAD_PRIVATE_DATA //decide the size of the matrix each thread shall create
    if(file_size > 1073741824)
        seg_rows = 1073741824 / (seg_size * num_threads);
    else
        seg_rows = file_size / (seg_size * num_threads);

	allocate_2D_matrix(&my_matrix, seg_rows, seg_size); 
#endif

    //Open file
    if(-1 == (fd = open(my_filename, O_RDWR|O_CREAT, 0644)))     
    {
        perror("Error opening file");
        strerror_r(errno, errnobuf, 256);
        printf("\nError opening file %s, thread %d, errno %d, error: %s \n", 
                    filename, tid, errno, errnobuf);
        cleanup();
        exit(-1);
    }

    pthread_barrier_wait(&barr);
    if(tid == 0)
        t1 = timestamp();

    //Perform IO
#ifdef _USE_THREAD_PRIVATE_DATA //write to contiguous offsets in file (compare with round-robin manner of writing)
	offset = tid * (file_size/num_threads);
	offset_gap = seg_size;
	index = 0;
#ifdef _SEP_FILES
    iterations = file_size/seg_size;
    offset = 0; //overwrite
#endif
	for(i=0; i<iterations;i++)
    {
        pwrite(fd, my_matrix[index], seg_size, offset);
        offset += offset_gap;
        index ++;
        if(index > seg_rows-1)
            index = 0;
    }
#else   //if writing from global matrix, write to file in a round-robin manner
	index = tid;
    offset = index * seg_size; 
	offset_gap = num_threads * seg_size;
#ifdef _SEP_FILES
    iterations = file_size/seg_size;
    offset = 0;
    offset_gap = seg_size;
#endif
	for(i=0; i<iterations;i++)
    {
        pwrite(fd, matrix[index], seg_size, offset);
        offset += offset_gap;
        index += num_threads;
        if(index > seg_rows-1)
            index = tid;
    }
#endif

    fsync(fd);
    pthread_barrier_wait(&barr);

    if(tid == 0)
        t2 = timestamp();
    
    //Close file
    close(fd);

#ifdef _USE_THREAD_PRIVATE_DATA
    free(my_matrix);
#endif

    if(tid == 0)
    {
        float bw;
#ifdef _SEP_FILES
        bw = (file_size*num_threads)/((t2-t1)*1024*1024);
#else
        bw = file_size/((t2-t1)*1024*1024);
#endif
        if(seg_size >= 1048576)
            printf("seg_size: %ld MB, BW: %f MBps\n", seg_size/(1024*1024), bw);
        else
            printf("seg_size: %ld KB, BW: %f MBps\n", seg_size/1024, bw);
		
		if(DEBUG) //open the resulting file and verify the contents
		{
#ifndef _USE_THREAD_PRIVATE_DATA
			validate_data_dump(matrix, filename); //not checked if this works. Might segfault. 
#endif
		}
        cleanup();
    }
}

/*
Create an array of filenames when threads need to write to separate files
*/
void create_filename_array()
{
    filename_array = (char **) malloc (num_threads * sizeof(char *));
    int i;
    char ext[4];
    for (i=0; i<num_threads; i++)
    {
        filename_array[i] = (char *) malloc (strlen(filename)+4);
        snprintf(filename_array[i], strlen(filename)+1, "%s", filename);
        snprintf(ext,3,"%d", i);
        strncat(filename_array[i], ext, 2);
    }
}

void read_cmd_args(int argc, char** argv)
{
    if(argc < 5)
    {
        print_default_input_format();
		exit(-1);
    }

    file_size = atol(argv[1]);
    seg_size = atol(argv[2]);
    filename = argv[3];
    num_threads = atoi(argv[4]);

    if(file_size > 1073741824)
        seg_rows = 1073741824/seg_size;
    else
        seg_rows = file_size/seg_size;

#ifdef _DEBUG
	DEBUG = 1;
#endif
}

void print_input_args()
{
    printf("Input args: file_size: %ld, seg_size: %ld, filename: %s, num_threads: %d\n",
                file_size, seg_size, filename, num_threads);
}

void print_default_input_format()
{
    printf("Input arguments expected in the form:\n" 
            "file_size, seg_size, filename, num_threads, (optional:) -D_DEBUG -D_USE_THREAD_PRIVATE_DATA "
			"-D_SEP_FILES\n");
}

double timestamp()
{ 
    double t;
    gettimeofday(&tv, NULL);
    t = tv.tv_sec + (tv.tv_usec/1000000.0);
    return t;
}

void cleanup()
{
#ifndef _USE_THREAD_PRIVATE_DATA
    free(matrix);
#endif
    free(threads);
    free(threadids);
    pthread_barrier_destroy(&barr);
}

void validate_data_dump(char** matrix, char* filename)
{
    int i, index, fd, valid = 1;
    index = 0;
    long offset = 0;
    //char ** buf;
    //char * temp = (char *) malloc (seg_size * sizeof(char));
    char temp[seg_size+1], temp2[seg_size+1];
    temp[seg_size] = '\0';
    temp2[seg_size] = '\0';

    if(DEBUG) printf("Validating contents of output file...\n");
    fd = open(filename, O_RDONLY, 0644);
    //if(DEBUG) printf("File opened %d\n", fd);
    //buf = (char **)temp_buf;
    for(i = 0; i<file_size/seg_size; i++)
    {
        pread(fd, temp, seg_size, offset);
        strncpy(temp2, matrix[index], seg_size);
        if(strncmp(temp2, temp, seg_size) != 0)
        {
            valid = 0;
            printf("Invalid data in output file. Expected %s\n\n, got %s \n\n at matrix pos %d, iteration %d, strcmp: %d\nExiting validator.\n", temp2, temp, index, i, strcmp(temp2, temp));
            break;
        }
        //else
        //printf("Verified buf[%d]", index);
        index++;
        offset += seg_size;
        if(index >= seg_rows) index = 0;
    }
    if(valid /* && DEBUG */)
        printf("Output file contents verified\n");
    close(fd);
}

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Poor multithreaded I/O performance

Reply via email to