[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-23 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

Thomas Koenig  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |FIXED

--- Comment #42 from Thomas Koenig  ---
Resolved, closing.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-23 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #41 from Thomas Koenig  ---
Author: tkoenig
Date: Tue Jul 23 08:57:45 2019
New Revision: 273727

URL: https://gcc.gnu.org/viewcvs?rev=273727=gcc=rev
Log:
2019-07-23  Thomas König  

Backport from trunk
PR libfortran/91030
* gfortran.texi (GFORTRAN_FORMATTED_BUFFER_SIZE): Document.
(GFORTRAN_UNFORMATTED_BUFFER_SIZE): Likewise.

2019-07-23  Thomas König  

Backport from trunk
PR libfortran/91030
* io/unix.c (BUFFER_SIZE): Delete.
(BUFFER_FORMATTED_SIZE_DEFAULT): New variable.
(BUFFER_UNFORMATTED_SIZE_DEFAULT): New variable.
(unix_stream): Add buffer_size.
(buf_read): Use s->buffer_size instead of BUFFER_SIZE.
(buf_write): Likewise.
(buf_init): Add argument unformatted.  Handle block sizes
for unformatted vs. formatted, using defaults if provided.
(fd_to_stream): Add argument unformatted in call to buf_init.
* libgfortran.h (options_t): Add buffer_size_formatted and
buffer_size_unformatted.
* runtime/environ.c (variable_table): Add
GFORTRAN_UNFORMATTED_BUFFER_SIZE and
GFORTRAN_FORMATTED_BUFFER_SIZE.


Modified:
branches/gcc-9-branch/gcc/fortran/ChangeLog
branches/gcc-9-branch/gcc/fortran/gfortran.texi
branches/gcc-9-branch/libgfortran/ChangeLog
branches/gcc-9-branch/libgfortran/io/unix.c
branches/gcc-9-branch/libgfortran/libgfortran.h
branches/gcc-9-branch/libgfortran/runtime/environ.c

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-21 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #40 from Thomas Koenig  ---
Author: tkoenig
Date: Sun Jul 21 15:55:49 2019
New Revision: 273643

URL: https://gcc.gnu.org/viewcvs?rev=273643=gcc=rev
Log:
2019-07-21  Thomas König  

PR libfortran/91030
* gfortran.texi (GFORTRAN_FORMATTED_BUFFER_SIZE): Document
(GFORTRAN_UNFORMATTED_BUFFER_SIZE): Likewise.

2019-07-21  Thomas König  

PR libfortran/91030
* io/unix.c (BUFFER_SIZE): Delete.
(BUFFER_FORMATTED_SIZE_DEFAULT): New variable.
(BUFFER_UNFORMATTED_SIZE_DEFAULT): New variable.
(unix_stream): Add buffer_size.
(buf_read): Use s->buffer_size instead of BUFFER_SIZE.
(buf_write): Likewise.
(buf_init): Add argument unformatted.  Handle block sizes
for unformatted vs. formatted, using defaults if provided.
(fd_to_stream): Add argument unformatted in call to buf_init.
* libgfortran.h (options_t): Add buffer_size_formatted and
buffer_size_unformatted.
* runtime/environ.c (variable_table): Add
GFORTRAN_UNFORMATTED_BUFFER_SIZE and
GFORTRAN_FORMATTED_BUFFER_SIZE.


Modified:
trunk/gcc/fortran/ChangeLog
trunk/gcc/fortran/gfortran.texi
trunk/libgfortran/ChangeLog
trunk/libgfortran/io/unix.c
trunk/libgfortran/libgfortran.h
trunk/libgfortran/runtime/environ.c

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-08 Thread jb at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #39 from Janne Blomqvist  ---
Now, with the fixed benchmark in the previous comment, on Lustre (version 2.5)
system I get:

Test using 25000 bytes
Block size of file system: 4096
bs =   1024, 53.27 MiB/s
bs =   2048, 73.99 MiB/s
bs =   4096, 222.41 MiB/s
bs =   8192, 351.38 MiB/s
bs =  16384, 483.86 MiB/s
bs =  32768, 583.76 MiB/s
bs =  65536, 677.11 MiB/s
bs = 131072, 748.60 MiB/s
bs = 262144, 700.69 MiB/s
bs = 524288, 811.76 MiB/s
bs =1048576, 1032.99 MiB/s
bs =2097152, 1034.03 MiB/s
bs =4194304, 1063.74 MiB/s
bs =8388608, 1030.15 MiB/s
bs =   16777216, 1084.82 MiB/s
bs =   33554432, 1067.05 MiB/s
bs =   67108864, 1063.79 MiB/s


On the same system, on a NFS filesystem connected with Infiniband I get:

Test using 25000 bytes
Block size of file system: 1048576
bs =   1024, 301.41 MiB/s
bs =   2048, 351.51 MiB/s
bs =   4096, 471.39 MiB/s
bs =   8192, 444.61 MiB/s
bs =  16384, 510.88 MiB/s
bs =  32768, 527.99 MiB/s
bs =  65536, 516.57 MiB/s
bs = 131072, 481.38 MiB/s
bs = 262144, 514.29 MiB/s
bs = 524288, 462.06 MiB/s
bs =1048576, 528.30 MiB/s
bs =2097152, 526.76 MiB/s
bs =4194304, 501.09 MiB/s
bs =8388608, 493.61 MiB/s
bs =   16777216, 550.24 MiB/s
bs =   33554432, 532.20 MiB/s
bs =   67108864, 532.82 MiB/s


So for Lustre, a buffer size bigger than the current 8 kB at least seems
justified.  While Lustre sees improvements all the way to 1 MB buffer size,
such large buffers by default seems a bit excessive.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-08 Thread jb at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #38 from Janne Blomqvist  ---
First, I think there's a bug in the benchmark in comment #c20. It writes
blocksize * sizeof(double), but then advances only blocksize for each iteration
of the loop. Fixed version writing just bytes below:

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

double walltime (void)
{
  struct timeval TV;
  double elapsed;
  gettimeofday(, NULL);
  elapsed = (double) TV.tv_sec + 1.0e-6*((double) TV.tv_usec);
  return elapsed;
}

#define NAME "out.dat"
#define N 25000

int main()
{
  int fd;
  unsigned char *p, *w;
  long i, size, blocksize, left, to_write;
  int bits;
  double t1, t2;
  struct statvfs buf;

  printf ("Test using %ld bytes\n", (long) N);
  statvfs (".", );
  printf ("Block size of file system: %ld\n", buf.f_bsize);

  p = malloc(N * sizeof (*p));
  for (i=0; i 0)
{
  if (left >= blocksize)
to_write = blocksize;
  else
to_write = left;

  write (fd, w, blocksize);
  w += to_write;
  left -= to_write;
}
  close (fd);
  t2 = walltime ();
  printf ("%.2f MiB/s\n", N / (t2-t1) / 1048576);
}
  free (p);
  unlink (NAME);

  return 0;
}

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-07 Thread jb at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #37 from Janne Blomqvist  ---
One thing we could do would be to switch to pread and pwrite instead of using
lseek. That would avoid a few syscalls when updating the record length marker.
Though I guess the issue with GPFS isn't directly related to the number of
syscalls?

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-07 Thread jb at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

Janne Blomqvist  changed:

   What|Removed |Added

 CC||jb at gcc dot gnu.org

--- Comment #36 from Janne Blomqvist  ---
I have access to a system with Lustre, which is another parallel file system
popular in HPC. Unfortunately I don't have gcc trunk setup there, but I can
easily test the C benchmark; give me a day or two.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-04 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #35 from Jerry DeLisle  ---
(In reply to Thomas Koenig from comment #34)
> There is another point to consider.
> 
> I suppose not very many people use big-endian data formats
> these days. Little-endian dominates these days, and people
> who require that conversion on a regular basis (why does
> HPC need that, by the way?) are probably few and far between.
> 
> Another question is if people who do serious HPC work do
> a lot of stuff (without conversion) like
> 
>   write(10) x(1::2)
> 
> which would actually use the buffers, instead of
> 
>   write (10) x
> 
> where the whole buffering discussion does not apply.
> 
> Jerry, if you use strides in writing, without conversion,
> what result would you get for different block sizes?
> 

Disregard my previous data. If I run the tests manually outside of the script
you provided I get consistent results:

$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=1024 ./a.out
   2.7986080646514893 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=4096 ./a.out
   2.5836510658264160 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=8192 ./a.out
   2.5744562149047852 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=16384 ./a.out
   2.4813480377197266 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=32768 ./a.out
   2.5214788913726807 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=65536 ./a.out
   2.4661610126495361 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=131072 ./a.out
   2.4065649509429932 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=262144 ./a.out
   2.4941890239715576 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=524288 ./a.out
   2.3842790126800537 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=1048576 ./a.out
   2.4531490802764893 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=2097152 ./a.out
   2.5236811637878418 

So there is a sweet spot at the 131072 point on this particular machine, so I
agree we should be able to go higher (that inconsistency I reported earlier was
bugging me enough to experiment and I discovered this, Ryzen 2500U).

Strides without conversion:

$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=65536 ./a.out
   1.8322470188140869 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=65536 ./a.out
   1.8337209224700928 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=131072 ./a.out
   1.8346250057220459 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=262144 ./a.out
   1.8497080802917480 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=524288 ./a.out
   1.8243398666381836 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=1048576 ./a.out
   1.7886412143707275 
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=2097152 ./a.out
   1.8285851478576660

All things considered I would say go for the higher value and the users can set
the environment variable lower if they need to.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-04 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #34 from Thomas Koenig  ---
There is another point to consider.

I suppose not very many people use big-endian data formats
these days. Little-endian dominates these days, and people
who require that conversion on a regular basis (why does
HPC need that, by the way?) are probably few and far between.

Another question is if people who do serious HPC work do
a lot of stuff (without conversion) like

  write(10) x(1::2)

which would actually use the buffers, instead of

  write (10) x

where the whole buffering discussion does not apply.

Jerry, if you use strides in writing, without conversion,
what result would you get for different block sizes?

If that is reasonably fast, then I am now leaning towards
making the default buffer much larger for unformatted.
Formatted default can stay as it is (adjustable via
environment variable), making the buffers larger there
would just be a waste of memory because of the
large CPU load in converting floating point numbers
(unless somebody can show a reasonable benchmark
demonstrating otherwise).

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-04 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #33 from Jerry DeLisle  ---
Well, I am not opposed to it. What we do not want is to pessimize older smaller
machines where it does matter a lot. However if Thomas strategy above is
adjusted from 32768 to 65536 then out of the box it will work for your system
which is the very first one like this we have encountered (it appears unique
from my perspective).  We are simply trying to strike the balance across a
population for which we have a microscopic sample size shown in this PR. We
came up with the 8192 before from also a small sample size.  I have another
machine here where it makes no difference either way and another where it does
really good most of the time at 1024 (believe it or not).

Thomas approach is an attempt at the heuristic. Now your idea of a page size
angle I need to exlore a bit here and see what this thing is doing. I doubt the
HPC users are the majority in number but they are certainly highly important. I
know many users around here where I am that use gfortran on there office
workstations for preliminary testing and development before they go to the big
iron to finalize.

With the above said, I think your specific needs at 65536 can be satisfied and
we do appreciate the data and testing from you. I do wonder if we need to make
"Optimizing I/O" a blatently obvious topic right at the TOP of all our
documentation on web page as well as docs.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-04 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #32 from David Edelsohn  ---
If the performance measured by Jerry is hitting limits of the 4 x 32KiB L1
D-Cache of the Ryzen 2500U, then the system has bigger problems than FORTRAN
I/O buffer size.

What is the target audience / market for GNU FORTRAN?

FORTRAN primarily is used for numerically intensive computing and HPC.  This
issue was discovered through an experiment by an organization that perform huge
HPC simulations and inquired about the performance of GNU FORTRAN.  I suggest
that GNU FORTRAN implement defaults appropriate for HPC systems if it wants to
increase adoption in large-scale commercial environments.

If we can find some heuristics that allow GNU FORTRAN to distinguish between
consumer and commercial systems, that would be ideal.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-04 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #31 from David Edelsohn  ---
What is the PAGESIZE on the Ryzen system?  On the POWER systems, the PAGESIZE
is 64K.  Maybe the optimal buffer size (write size) allows the filesystem to
perform double-buffering at the PAGESIZE.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-04 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #30 from Thomas Koenig  ---

> Why are you opposed to the larger 65536 or 131072 as a default?

Please look at Jerry's numbers from comment #24.

They show a severe regression (for his system) for blocksizes > 32768.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-04 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #29 from David Edelsohn  ---
> For formatted files, chose the value that the user supplied
> via an environment variable. If the user supplied nothing, then
> 
> - query the recommended block size via calling fstat and evaluating
>   st_blksize.
> - If st_blksize is less than 8192, use 8192 (current behavior)
> - if st_blksize is more than 32768, use 32768
> - otherwise use st_blksize

I assume that you meant UNformatted files.

Why are you opposed to the larger 65536 or 131072 as a default? The benefit at
that level is reproducible, _even for filesystems with smaller block size_.

Why propose another default value that restricts GNU FORTRAN performance when
given the opportunity to fix this and make GNU FORTRAN performance look very
good "out of the box". Few people will bother to read the documentation to look
for environment variables or even realize that unformatted I/O performance is
the bottleneck.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-04 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #28 from Thomas Koenig  ---
(In reply to Jerry DeLisle from comment #27)
> (In reply to Thomas Koenig from comment #26)
> > Jerry, you are working on a Linux box, right?  What does
> > 
> > stat -f -c %b .
> > 
> > tell you?
> 
> 13429330

So we cannot really take the values from the file system for the
buffer size at face value, at least not for determining the buffer size
for this particular case.

Last question (grasping at straws here): Are the values from
comment #24 reproducible, do you always get this big jump for
block size 6553?

If they are indeed reproducible, I would suggest using an approach
slightly modified from the attached patch:

For formatted files, chose the value that the user supplied
via an environment variable, or 8192 otherwise. (Formatted is so
slow that we might as well save the memory).

For formatted files, chose the value that the user supplied
via an environment variable. If the user supplied nothing, then

- query the recommended block size via calling fstat and evaluating
  st_blksize.
- If st_blksize is less than 8192, use 8192 (current behavior)
- if st_blksize is more than 32768, use 32768
- otherwise use st_blksize

How does that sound?

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-03 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #27 from Jerry DeLisle  ---
(In reply to Thomas Koenig from comment #26)
> Jerry, you are working on a Linux box, right?  What does
> 
> stat -f -c %b .
> 
> tell you?

13429330

Ryzen 2500U with M.2 SSD
Fedora 30, Kernel 5.1.15-300.fc30.x86_64

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-03 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #26 from Thomas Koenig  ---
Jerry, you are working on a Linux box, right?  What does

stat -f -c %b .

tell you?

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-02 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #25 from Thomas Koenig  ---
(In reply to Jerry DeLisle from comment #24)
> On a different Ryzen machine:
> 
> $ ./run.sh 
> 1024   3.2604169845581055 
> 2048   2.7804551124572754 
> 4096   2.6416599750518799 
> 8192   2.5986809730529785 
> 16384   2.5525100231170654 
> 32768   2.5145640373229980 
> 65536   9.2993371486663818 
> 131072   9.0313489437103271

Oops.

That increase for 65536 might be an L1 cache effect.

Note: We are measuring only transfer speed to cache
here. Transfer to actual hard disks will be much
slower.  It is still relevant though, especially since
for the usual cycle of repeatedly calculating and writing
data.  The OS can then sync the data to disc at its
leisure while the next calculation is running.

So, what would be a good strategy to select a block size?

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-01 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #24 from Jerry DeLisle  ---
On a different Ryzen machine:

$ ./run.sh 
1024   3.2604169845581055 
2048   2.7804551124572754 
4096   2.6416599750518799 
8192   2.5986809730529785 
16384   2.5525100231170654 
32768   2.5145640373229980 
65536   9.2993371486663818 
131072   9.0313489437103271

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-01 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #23 from Thomas Koenig  ---
Some numbers for the provisionary patch, varying the
size for the buffers.

With the patch, the original benchmark (minus some output, only
the elapsed time is shown) and the script

for a in 1024 2048 4096 8192 16384 32768 65536 131072
do
rm -f out.dat
sync ; sync; sync
sleep 1
echo -n $a
GFORTRAN_BUFFER_SIZE_UNFORMATTED=$a ./a.out
done
rm -f out.dat
sync

I get on my home Ryzen box with ext4

1024   2.959884643555 
2048   2.4514980316162109 
4096   2.2090110778808594 
8192   1.9955158233642578 
16384   2.0065548419952393 
32768   1.9320869445800781 
65536   1.9494299888610840 
131072   1.8885779380798340  

On gcc135 (POWER9) I get

1024   6.2069039344787598 
2048   3.5782949924468994 
4096   2.2184860706329346 
8192   1.4914679527282715 
16384   1.1247980594635010 
32768  0.95092821121215820 
65536  0.85877490043640137 
131072  0.82407808303833008

and on gcc115 (aarch64):

1024   10.543070077896118 
2048   7.3426060676574707 
4096   5.7169480323791504 
8192   4.7394258975982666 
16384   4.2912349700927734 
32768   4.0224111080169678 
65536   3.8719530105590820 
131072   3.8628818988800049

so 64 k looks like a good choice, except for the Ryzen machine,
where 8k would be sufficient.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-07-01 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #22 from David Edelsohn  ---
The following are unofficial results on an unspecified system running GPFS. 
These should not be considered official anything and should not be referenced
for benchmarking.

Test using 2.50e+08 doubles
Block size of file system: 16777216
bs =   1024, 126.53 MiB/s
bs =   2048, 218.69 MiB/s
bs =   4096, 335.00 MiB/s
bs =   8192, 436.25 MiB/s
bs =  16384, 774.91 MiB/s
bs =  32768, 619.28 MiB/s
bs =  65536, 1018.89 MiB/s
bs = 131072, 659.44 MiB/s
bs = 262144, 629.90 MiB/s
bs = 524288, .63 MiB/s
bs =1048576, 678.90 MiB/s
bs =2097152, 1029.28 MiB/s
bs =4194304, 668.27 MiB/s
bs =8388608, 662.53 MiB/s
bs =   16777216, .37 MiB/s
bs =   33554432, 694.28 MiB/s
bs =   67108864, 1091.94 MiB/s

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-30 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #21 from Thomas Koenig  ---
Created attachment 46537
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46537=edit
Something to benchmark.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-29 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #20 from Thomas Koenig  ---
(In reply to David Edelsohn from comment #18)
> For GPFS, the striping unit is 16M.  The 8K buffer size chosen by GFortran
> is a huge performance sink. We have confirmed this with testing.

Could you share some benchmarks on this?  I'd really like if the
gfortran maintainers could form their own judgment on this, based
on numbers.

Here's a benchmark program:

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

double walltime (void)
{
  struct timeval TV;
  double elapsed;
  gettimeofday(, NULL);
  elapsed = (double) TV.tv_sec + 1.0e-6*((double) TV.tv_usec);
  return elapsed;
}

#define NAME "out.dat"
#define N 25000

int main()
{
  int fd;
  double *p, *w;
  long i, size, blocksize, left, to_write;
  int bits;
  double t1, t2;
  struct statvfs buf;

  printf ("Test using %e doubles\n", N * 1.0);
  statvfs (".", );
  printf ("Block size of file system: %ld\n", buf.f_bsize);

  p = malloc(N * sizeof (*p));
  for (i=0; i 0)
{
  if (left >= blocksize)
to_write = blocksize;
  else
to_write = left;

  write (fd, w, blocksize * sizeof (double));
  w += to_write;
  left -= to_write;
}
  close (fd);
  t2 = walltime ();
  printf ("%.2f MiB/s\n", N / (t2-t1) / 1048576);
}
  free (p);
  unlink (NAME);

  return 0;
}

And here is some output on my home system (ext4):

Test using 2.50e+08 doubles
Block size of file system: 4096
bs =   1024, 175.81 MiB/s
bs =   2048, 244.40 MiB/s
bs =   4096, 247.27 MiB/s
bs =   8192, 227.46 MiB/s
bs =  16384, 195.55 MiB/s
bs =  32768, 223.14 MiB/s
bs =  65536, 168.95 MiB/s
bs = 131072, 240.70 MiB/s
bs = 262144, 260.39 MiB/s
bs = 524288, 265.38 MiB/s
bs =1048576, 261.67 MiB/s
bs =2097152, 259.94 MiB/s
bs =4194304, 258.71 MiB/s
bs =8388608, 262.19 MiB/s
bs =   16777216, 260.19 MiB/s
bs =   33554432, 263.37 MiB/s
bs =   67108864, 264.47 MiB/s

And here is something on gcc135 (POWER9), also ext4:

Test using 2.50e+08 doubles
Block size of file system: 4096
bs =   1024, 206.76 MiB/s
bs =   2048, 293.66 MiB/s
bs =   4096, 347.13 MiB/s
bs =   8192, 298.23 MiB/s
bs =  16384, 397.51 MiB/s
bs =  32768, 401.86 MiB/s
bs =  65536, 431.83 MiB/s
bs = 131072, 475.88 MiB/s
bs = 262144, 470.09 MiB/s
bs = 524288, 478.84 MiB/s
bs =1048576, 485.68 MiB/s
bs =2097152, 485.33 MiB/s
bs =4194304, 483.96 MiB/s
bs =8388608, 482.88 MiB/s
bs =   16777216, 485.04 MiB/s
bs =   33554432, 483.92 MiB/s
bs =   67108864, 485.55 MiB/s

So, write thoughput sort of seems to level out at ~ 131072 block size,
2**17.

For Fortran, this is only really relevant for unformatted files.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-29 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #19 from David Edelsohn  ---
IBM XLF provides an XLFRTEOPTS environment variable, which includes control
over buffer size.  The documentation makes it clear that XLF uses the block
size of the device by default:

buffer_size=size
Specifies the size of I/O buffers in bytes instead of using the block size of
devices. size must be either -1 or an integer value that is greater than or
equal to 4096. The default, -1, uses the block size of the device where the
file resides.
Using this option can reduce the amount of memory used for I/O buffers when an
application runs out of memory because the block size of devices is very large
and the application opens many files at the same time.

Note the following when using this runtime option:
Preconnected units remain unaffected by this option. Their buffer size is the
same as the block size of the device where they reside except when the block
size is larger than 64KB, in which case the buffer size is set to 64KB.
This runtime option does not apply to files on a tape device or logical volume.
Specifying the buffer size with the SETRTEOPTS procedure overrides any value
previously set by the XLFRTEOPTS environment variable or SETRTEOPTS procedure.
The resetting of this option does not affect units that have already been
opened.

https://www.ibm.com/support/knowledgecenter/SSAT4T_16.1.1/com.ibm.xlf1611.lelinux.doc/compiler_ref/rteopts.html

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-29 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #18 from David Edelsohn  ---
For GPFS, the striping unit is 16M.  The 8K buffer size chosen by GFortran is a
huge performance sink. We have confirmed this with testing.

The recommendation from GPFS is that one should query the filesystem with
fstat() and write in chunks of the block size.

Instead of arbitrarily choosing a uniform buffer size of 8K, GFortran would
achieve better I/O performance in general by dynamically querying the
filesystem characteristics and choosing a buffer size tuned to the filesystem.

Presumably one must find some balance of memory consumption if the application
opens a huge number of files.

Or maybe some environment variable to override the buffer size.

IBM XL FORTRAN achieves better performance, even for EXT4, by adapting I/O to
the filesystem block size.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-29 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #17 from Thomas Koenig  ---
(In reply to David Edelsohn from comment #16)
> libgfortran unix.c:raw_write() will access the WRITE system call with up to
> 2GB of data, which the testcase is using for the native format.
> 
> Should libgfortran I/O buffer at least use sysconf(_SC_PAGESIZE) instead of
> hard coding 8192?

Depends.  I would try not to blow away too much cache for
such an operation.

So far, this problem appears to be limited to POWER, and more
specifically to file systems which are typically used in HPC.

Could you (generic you, people who have access to such systems)
show us some benchmarks which show performance as a function of
block write size?

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #16 from David Edelsohn  ---
libgfortran unix.c:raw_write() will access the WRITE system call with up to 2GB
of data, which the testcase is using for the native format.

Should libgfortran I/O buffer at least use sysconf(_SC_PAGESIZE) instead of
hard coding 8192?

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #15 from Thomas Koenig  ---
(In reply to David Edelsohn from comment #13)
> Why should -fconvert affect the strategy for writing?

If we get passed a contiguous block of memory (like in
your test case) we can do this in a single write.

If we want to swap bytes, this needs to be done on a basis
of each data item. It would be wasteful to write out each
data item by itself, so do this by copying to a buffer until it
is full, and then writing out the buffer.

The effect on speed could be tested simply enough. Just write
two test programs, one of them mimicking the current behavior of
libgfortran (writing out 8192 byte blocks, possibly starting with
a smaller size) and the other one with one huge write.  Use the
write() system call directly. Benchmark both.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #14 from Jerry DeLisle  ---
(In reply to David Edelsohn from comment #13)
> Why should -fconvert affect the strategy for writing?

Hi David, very interesting bug report and a good question. I would like to
investigate further if I know what platform this is on.

Since GPFS is a parallel operating system that is potentially going across a
network, it could be depending on the test environment. Also, there are a few
cases where code paths in libgfortran can depend on OS features.

So in your results here:

> With EXT4: difference is 2x
> With SHM: difference is 4.5x
> With GPFS: difference is 10x
> 
> Is libgfortran doing something unusual with the creation of files?

Are all your results here under identical OS and are the physical drives local
to the test machine hardware? If we can reproduce this on a gcc compile farm
machine or maybe at OSU Open Software lab which I can access, we ought to be
able to do better.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #13 from David Edelsohn  ---
Why should -fconvert affect the strategy for writing?

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #12 from Andrew Pinski  ---
(In reply to David Edelsohn from comment #10)
> With EXT4: difference is 2x
> With SHM: difference is 4.5x
> With GPFS: difference is 10x
> 
> Is libgfortran doing something unusual with the creation of files?

So it looks like native is just doing one write system call while opposite
endian is doing 8k chunks write system calls.  This seems like an issue with
the file system if it cannot handle 8k chunks.

Maybe increasing the chunk size to 64k inside libgfortran will help.

Something like:
diff --git a/libgfortran/io/unix.c b/libgfortran/io/unix.c
index c2fc674..5d24ac4 100644
--- a/libgfortran/io/unix.c
+++ b/libgfortran/io/unix.c
@@ -193,7 +193,7 @@ fallback_access (const char *path, int mode)

 /* Unix and internal stream I/O module */

-static const int BUFFER_SIZE = 8192;
+static const int BUFFER_SIZE = 64*1024;

 typedef struct
 {

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #11 from Thomas Koenig  ---
(In reply to David Edelsohn from comment #10)
> With EXT4: difference is 2x
> With SHM: difference is 4.5x
> With GPFS: difference is 10x
> 
> Is libgfortran doing something unusual with the creation of files?

Not really, but there is one difference.

Stracing with -fconvert=native gives

open("out.dat", O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=28, ...}) = 0
write(3, "\0\0\0\0", 4) = 4
write(3,
"\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?"...,
20) = 20
lseek(3, 0, SEEK_SET)   = 0
write(3, "\0\2245w", 4) = 4
lseek(3, 24, SEEK_SET)  = 24
write(3, "\0\2245w", 4) = 4
ftruncate(3, 28)= 0
close(3)= 0

and using -fconvert=swap

open("out.dat", O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=28, ...}) = 0
write(3,
"\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0"...,
7684) = 7684
write(3,
"?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"...,
8192) = 8192
write(3,
"?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"...,
8192) = 8192

[...]

write(3,
"?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"...,
8192) = 8192
write(3,
"?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"...,
8192) = 8192
write(3,
"?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"...,
5632) = 5632
lseek(3, 0, SEEK_SET)   = 0 
write(3, "w5\224\0", 4) = 4 
lseek(3, 24, SEEK_SET)  = 24
write(3, "w5\224\0", 4) = 4 
ftruncate(3, 28)= 0 
close(3)= 0

Would this make such a large difference?

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #10 from David Edelsohn  ---
With EXT4: difference is 2x
With SHM: difference is 4.5x
With GPFS: difference is 10x

Is libgfortran doing something unusual with the creation of files?

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #9 from Thomas Koenig  ---
On powerpc64le-unknown-linux-gnu:

 write time(sec) =   0.48150300979614258 
 done

real0m0.889s
user0m0.279s
sys 0m0.608s

vs.

 write time(sec) =1.4788339138031006 
 done

real0m1.880s
user0m0.669s
sys 0m1.208s

Less good, but not as bad as you're reporting.

On aarch64-unknown-linux-gnu:

 write time(sec) =3.3060228824615479 
 done

real0m4.739s
user0m0.300s
sys 0m4.420s

vs.

 write time(sec) =4.7578129768371582 
 done

real0m6.091s
user0m1.080s
sys 0m5.000s

The factor is also bearable.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #8 from Andrew Pinski  ---
(In reply to Thomas Koenig from comment #7)
> Also, which version of gfortran did you use?
> 
> If it was before r195413, I can very well believe those
> numbers.

Note that revision made it into GCC 4.8.0.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

Thomas Koenig  changed:

   What|Removed |Added

 Status|NEW |WAITING

--- Comment #7 from Thomas Koenig  ---
Also, which version of gfortran did you use?

If it was before r195413, I can very well believe those
numbers.

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #6 from Thomas Koenig  ---
I cannot reproduce this on an AMD Ryzen 7 1700X (little-endian):

$ gfortran -fconvert=native wr.f90 walltime.c 
cc1: Warnung: command-line option »-fconvert=native« is valid for Fortran but
not for C
$ rm -f out.dat ; time ./a.out ; rm -f out.dat
 write time(sec) =1.0676949024200439 
 done

real0m1.399s
user0m0.112s
sys 0m1.083s
$ gfortran -fconvert=big-endian wr.f90 walltime.c 
cc1: Warnung: command-line option »-fconvert=big-endian« is valid for Fortran
but not for C
$ rm -f out.dat ; time ./a.out ; rm -f out.dat
 write time(sec) =1.4781639575958252 
 done

real0m1.773s
user0m0.397s
sys 0m1.196s

which looks reasonable.

Platform specific?  Which OS/processor combination did you test this on?

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #5 from David Edelsohn  ---
XL Fortran with -qufmt=be : 0.75 sec
XL Fortran native : 0.30 sec

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

--- Comment #4 from Thomas Koenig  ---
(In reply to David Edelsohn from comment #3)
> Conversion carries an overhead, but the overhead need not be worse than
> necessary.  The conversion overhead for libgfortran is significantly worse
> than for competing, proprietary compilers.
> 
> -fconvert=big-endian relative to no conversion
> Compiler  Slowdown
>   
> GFortran 1000%
> IBM XLF   200%
> Intel Fortran  20%

Do you also have absolute numbers?

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

Thomas Koenig  changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

David Edelsohn  changed:

   What|Removed |Added

   Severity|enhancement |normal

--- Comment #3 from David Edelsohn  ---
Conversion carries an overhead, but the overhead need not be worse than
necessary.  The conversion overhead for libgfortran is significantly worse than
for competing, proprietary compilers.

-fconvert=big-endian relative to no conversion
Compiler  Slowdown
  
GFortran 1000%
IBM XLF   200%
Intel Fortran  20%

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

Thomas Koenig  changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

Thomas Koenig  changed:

   What|Removed |Added

 CC||tkoenig at gcc dot gnu.org

--- Comment #2 from Thomas Koenig  ---
https://gcc.gnu.org/onlinedocs/gfortran/CONVERT-specifier.html

"Using anything but the native representation for unformatted data carries a
significant speed overhead. If speed in this area matters to you, it is best if
you use this only for data that needs to be portable."

[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian

2019-06-28 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030

David Edelsohn  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-06-28
 Ever confirmed|0   |1

--- Comment #1 from David Edelsohn  ---
Confirmed.