[GitHub] [arrow] raceordie690 commented on a diff in pull request #13120: ARROW-16530: [Go] Added concurrency in key places that are always serial, regardless if parallel=true or not

GitBox Sat, 14 May 2022 09:45:57 -0700


raceordie690 commented on code in PR #13120:
URL: https://github.com/apache/arrow/pull/13120#discussion_r873051130



##########
go/parquet/pqarrow/column_readers.go:
##########
@@ -216,12 +217,34 @@ func (sr *structReader) GetRepLevels() ([]int16, error) {
 }
 
 func (sr *structReader) LoadBatch(nrecords int64) error {
-       for _, rdr := range sr.children {
-               if err := rdr.LoadBatch(nrecords); err != nil {
-                       return err
+       var (
+               // REP -- Load batches in parallel
+               // When reading structs with large numbers of columns, the 
serial load is very slow.
+               // This is especially true when reading Cloud Storage. Loading 
concurrently
+               // greatly improves performance.
+               wg      sync.WaitGroup
+               errchan chan error = make(chan error)
+               err     error
+       )
+
+       //* Read First error from errchan and break only capturing first error
+       go func() {
+               for err = range errchan {
+                       break
                }
+       }()
+       wg.Add(len(sr.children))
+       for _, rdr := range sr.children {
+               go func(r *ColumnReader) {
+                       defer wg.Done()
+                       if err := r.LoadBatch(nrecords); err != nil {
+                               errchan <- err
+                       }
+               }(rdr)
        }
-       return nil
+       wg.Wait() // wait for reads to complete
+       close(errchan)

Review Comment:
   I looked at creating a helper, however there are complications.  The 
functionality of the different areas is such that it would make things 
complicated, whereas the pattern is pretty straight forward.   ReadRowGroups 
seems overly complicated to accomplish what needs to be done.  However, I'm 
reluctant to change something that works.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] raceordie690 commented on a diff in pull request #13120: ARROW-16530: [Go] Added concurrency in key places that are always serial, regardless if parallel=true or not

Reply via email to