Re: [PATCH] RFC: Cache LTO streamer mappings

2011-10-12 Thread Diego Novillo
On Sun, Oct 9, 2011 at 13:11, Andi Kleen a...@firstfloor.org wrote:

 Is it still a good idea?

Given that you found no speedups and it introduces added complexity, I
think it's best if we revisit the idea later.  I never found bytecode
reading to be a bottleneck in LTO, but perhaps Jan can comment what
the experience is with Mozilla builds.


Diego.


Re: [PATCH] RFC: Cache LTO streamer mappings

2011-10-12 Thread Diego Novillo

On 11-10-12 08:25 , Jan Hubicka wrote:


WPA is currently about 1/3 of readingtype merging, 1/3 of streaming out and
1/3 of inlining.  inlining is relatively easy to cure, so yes, streaming
performance is important.  The very basic streaming primitives actualy still
shows top in profile along with hashing and type comparing code.  I will post
some updated oprofiles into Mozilla PR.


OK, thanks.  My numbers are from very early LTO development.


Honestly I think we won't get any great speedups unless we work on reducing 
amount of
unnecesary info we pickle/unpickle.


That's what I was leaning towards.  Optimizing the basic access patterns 
may not buy us as much as just reducing the amount of clutter we have to 
deal with.  It may make sense, however, as a subsequent optimization.



Diego.


Re: [PATCH] RFC: Cache LTO streamer mappings

2011-10-12 Thread Jan Hubicka
 On 11-10-12 08:25 , Jan Hubicka wrote:

 WPA is currently about 1/3 of readingtype merging, 1/3 of streaming out and
 1/3 of inlining.  inlining is relatively easy to cure, so yes, streaming
 performance is important.  The very basic streaming primitives actualy still
 shows top in profile along with hashing and type comparing code.  I will post
 some updated oprofiles into Mozilla PR.

 OK, thanks.  My numbers are from very early LTO development.

Yeah, the problem is minor on small projects and C projects. C++ tends to carry
a lot of context with it - both in the files streamed from compilation to WPA
(a lot of types and such) as well as into individual ltrans units.

We still need to stream in and out about 2GB from WPA to ltrans (combined sizes
of ltrans0 to lstrans31) and since we are at 3 minutes of compilation now,
seconds actually count.


 Honestly I think we won't get any great speedups unless we work on reducing 
 amount of
 unnecesary info we pickle/unpickle.

 That's what I was leaning towards.  Optimizing the basic access patterns  
 may not buy us as much as just reducing the amount of clutter we have to  
 deal with.  It may make sense, however, as a subsequent optimization.

I will give this patch a try on Mozilla to see if I can report some positive 
numbers.
Obviously having the basic I/O effective is also important.

Honza


 Diego.


[PATCH] RFC: Cache LTO streamer mappings

2011-10-09 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

Currently the LTO file mappings use a simple one-off cache.
This doesn't match the access pattern very well.

This patch adds a real LRU of LTO mappings with a total limit.
Each file is completely mapped now instead of only specific
sections. This addresses one of the FIXME comments in LTO.

The limit is 256GB on 64bit and 256MB on 32bit. The limit
can be temporarily exceeded by a single file. The whole
file has to fit into the address space now. This may increase
the address space requirements a bit.

I originally wrote this in an attempt to minimze fragmentation
of the virtual memory map, but it didn't make too much difference
for that because it was all caused by GGC.

Also on my fairly large builds it didn't make a measurable
compile time difference, probably because it was shadowed
by other much slower passes. That is why I'm just sending it as a
RFC. It certainly complicates the code somewhat.

Maybe if people have other LTO builds they could try if it makes a
difference for them.

Is it still a good idea?

Passes a full LTO bootstrap plus test suite on x86-64-linux.

gcc/lto/:

2011-10-05   Andi Kleen a...@linux.intel.com

* lto.c (list_head, mapped_file): Add.
(page_mask): Rename to page_size.
(MB, GB, max_mapped, cur_mapped, mapped_lru, list_add, list_del): Add.
(mf_lru_enforce_limit, mf_hashtable, mf_lru_finish_cache, mf_eq): Add
(mf_hash, mf_lookup_or_create)
(lto_read_section_data): Split into two ifdef versions.
Implement version using LRU cache. Add more error checks.
(mf_lru_finish_cached): Add dummy in ifdef.
(free_section_data): Rewrite for LRU.
(read_cgraph_and_symbols): Call mf_lru_finish_cache.
---
 gcc/lto/lto.c |  292 ++---
 1 files changed, 238 insertions(+), 54 deletions(-)

diff --git a/gcc/lto/lto.c b/gcc/lto/lto.c
index a77eeb4..29dc3b8 100644
--- a/gcc/lto/lto.c
+++ b/gcc/lto/lto.c
@@ -1141,6 +1141,30 @@ lto_file_finalize (struct lto_file_decl_data *file_data, 
lto_file *file)
   lto_free_section_data (file_data, LTO_section_decls, NULL, data, len);
 }
 
+/* A list node or head */
+struct list_head
+  {
+struct list_head *next;  /* Next node */
+struct list_head *prev;  /* Prev node */
+  };
+
+/* Cache of mapped files */
+struct mapped_file
+  {
+struct list_head lru; /* LRU list. Must be first. */
+char *map; /* Mapping of the file */
+size_t size;   /* Size of mapping (rounded up) */
+int refcnt;/* Number of users */
+const char *filename;  /* File name */
+  };
+
+struct lwstate
+{
+  lto_file *file;
+  struct lto_file_decl_data **file_data;
+  int *count;
+};
+
 /* Finalize FILE_DATA in FILE and increase COUNT. */
 
 static int 
@@ -1200,65 +1224,213 @@ lto_file_read (lto_file *file, FILE *resolution_file, 
int *count)
 #endif
 
 #if LTO_MMAP_IO
+
 /* Page size of machine is used for mmap and munmap calls.  */
-static size_t page_mask;
-#endif
+static size_t page_size;
+
+#define MB (1UL  20)
+#define GB (1UL  30)
+
+/* Limit of mapped files */
+static unsigned HOST_WIDE_INT max_mapped = sizeof(void *)  4 ? 256*GB : 
256*MB;
+
+/* Total size of currently mapped files */
+static unsigned HOST_WIDE_INT cur_mapped; 
+
+/* LRU of mapped files */
+static struct list_head mapped_lru = { mapped_lru, mapped_lru };
+
+/* Add NODE into list HEAD */
+
+static void 
+list_add(struct list_head *node, struct list_head *head)
+{
+  struct list_head *prev = head;
+  struct list_head *next = head-next;
+
+  next-prev = node;
+  node-next = next;
+  node-prev = prev;
+  prev-next = node;
+}
+
+/* Remove NODE from list. */
+
+static void 
+list_del(struct list_head *node)
+{
+  struct list_head *prev = node-prev;
+  struct list_head *next = node-next;
+   
+  if (!next  !prev)
+return;
+  next-prev = prev;
+  prev-next = next;
+  node-next = NULL;
+  node-prev = NULL;
+}
+
+/* Enforce the global LRU limit MAX when the commitment changes by INCREMENT. 
*/
+
+static void
+mf_lru_enforce_limit (unsigned HOST_WIDE_INT increment, unsigned HOST_WIDE_INT 
max)
+{
+  struct mapped_file *mf;
+  unsigned HOST_WIDE_INT new_mapped = cur_mapped + increment;
+  struct list_head *node, *prev;
+
+  for (node = mapped_lru.prev; new_mapped  max  node != mapped_lru; node = 
prev) 
+{
+  prev = node-prev;
+  mf = (struct mapped_file *) node;
+  if (mf-refcnt  0)
+continue;
+  munmap (mf-map, mf-size);
+  mf-map = NULL;
+  new_mapped -= mf-size;
+  list_del (node);
+}
+
+  cur_mapped = new_mapped;
+}
+
+/* Hash table of mapped_files */
+static htab_t mf_hashtable;
+
+/* Free all mappings in the hash table. */
+
+static void
+mf_lru_finish_cache (void)
+{
+  mf_lru_enforce_limit (0, 0);
+  gcc_assert (mapped_lru.next == mapped_lru.prev);
+  htab_delete (mf_hashtable);
+  mf_hashtable = NULL;
+}
+
+/* Compare hash table entries A and B. */
+