[jira] [Work logged] (CRUNCH-698) Avro DataFileReader creation can hang

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-698?focusedWorklogId=546772=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546772
 ]

ASF GitHub Bot logged work on CRUNCH-698:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 01:37
Start Date: 03/Feb/21 01:37
Worklog Time Spent: 10m 
  Work Description: jwills merged pull request #34:
URL: https://github.com/apache/crunch/pull/34


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546772)
Time Spent: 1h  (was: 50m)

> Avro DataFileReader creation can hang
> -
>
> Key: CRUNCH-698
> URL: https://issues.apache.org/jira/browse/CRUNCH-698
> Project: Crunch
>  Issue Type: Bug
>  Components: Core, IO
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
> Fix For: 1.1.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> A severe Avro bug [AVRO-2944|https://issues.apache.org/jira/browse/AVRO-2944] 
> was recently found in the static method for creating a DataFileReader 
> instance, where it can get stuck in an infinite loop while trying to read the 
> 4 byte "magic" header of the file.
> The stack trace looks like this,
> {noformat}
> "main" #1 prio=5 os_prio=0 tid=0x7f8798027000 nid=0x7d9c runnable 
> [0x7f87a0924000]
>java.lang.Thread.State: RUNNABLE
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at org.apache.avro.mapred.FsInput.read(FsInput.java:54)
>   at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:55)
>   at 
> org.apache.crunch.types.avro.AvroRecordReader.initialize(AvroRecordReader.java:58)
>   at 
> org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:152)
>   at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:571)
>   at 
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
>   at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:802)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
> {noformat}
> This was fixed in Avro 1.10.1 but has not yet been patched to any other Avro 
> versions. The issue has existed since Avro 1.5 although we have encountered 
> it recently. It does not happen in normal circumstances, there has to be some 
> very unusual input stream behavior (partial/throttled read, or unexpected 
> EOF) causing it. We've only seen it with the S3AFileSystem's S3AInputStream, 
> suddenly starting a few days ago for no apparent reason. Even now it is 
> sporadic, happening a small percent of the time in job tasks that read many 
> S3 files but often enough to be problematic. An AWS support case is open to 
> attempt to find out what could have caused this.
> To avoid the external dependency on a particular Avro version to fix this, we 
> can probably just patch this locally in Crunch since it's only one static 
> method and apart from one legacy constant everything we need access to in the 
> Avro code is public.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (CRUNCH-698) Avro DataFileReader creation can hang

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-698?focusedWorklogId=546751=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546751
 ]

ASF GitHub Bot logged work on CRUNCH-698:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 01:35
Start Date: 03/Feb/21 01:35
Worklog Time Spent: 10m 
  Work Description: noslowerdna opened a new pull request #34:
URL: https://github.com/apache/crunch/pull/34


   Fixes [AVRO-2944](https://issues.apache.org/jira/browse/AVRO-2944) where 
Avro's static method for creating a DataFileReader instance can get stuck in an 
infinite loop while trying to read the 4 byte "magic" header of the file. More 
details can be found at 
[CRUNCH-698](https://issues.apache.org/jira/browse/CRUNCH-698).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546751)
Time Spent: 50m  (was: 40m)

> Avro DataFileReader creation can hang
> -
>
> Key: CRUNCH-698
> URL: https://issues.apache.org/jira/browse/CRUNCH-698
> Project: Crunch
>  Issue Type: Bug
>  Components: Core, IO
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
> Fix For: 1.1.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A severe Avro bug [AVRO-2944|https://issues.apache.org/jira/browse/AVRO-2944] 
> was recently found in the static method for creating a DataFileReader 
> instance, where it can get stuck in an infinite loop while trying to read the 
> 4 byte "magic" header of the file.
> The stack trace looks like this,
> {noformat}
> "main" #1 prio=5 os_prio=0 tid=0x7f8798027000 nid=0x7d9c runnable 
> [0x7f87a0924000]
>java.lang.Thread.State: RUNNABLE
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at org.apache.avro.mapred.FsInput.read(FsInput.java:54)
>   at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:55)
>   at 
> org.apache.crunch.types.avro.AvroRecordReader.initialize(AvroRecordReader.java:58)
>   at 
> org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:152)
>   at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:571)
>   at 
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
>   at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:802)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
> {noformat}
> This was fixed in Avro 1.10.1 but has not yet been patched to any other Avro 
> versions. The issue has existed since Avro 1.5 although we have encountered 
> it recently. It does not happen in normal circumstances, there has to be some 
> very unusual input stream behavior (partial/throttled read, or unexpected 
> EOF) causing it. We've only seen it with the S3AFileSystem's S3AInputStream, 
> suddenly starting a few days ago for no apparent reason. Even now it is 
> sporadic, happening a small percent of the time in job tasks that read many 
> S3 files but often enough to be problematic. An AWS support case is open to 
> attempt to find out what could have caused this.
> To avoid the external dependency on a particular Avro version to fix this, we 
> can probably just patch this locally in Crunch since it's only one static 
> method and apart from one legacy constant everything we need access to in the 
> Avro code is public.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (CRUNCH-698) Avro DataFileReader creation can hang

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-698?focusedWorklogId=546608=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546608
 ]

ASF GitHub Bot logged work on CRUNCH-698:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 01:21
Start Date: 03/Feb/21 01:21
Worklog Time Spent: 10m 
  Work Description: jwills commented on pull request #34:
URL: https://github.com/apache/crunch/pull/34#issuecomment-771810234


   LGTM-- thank you!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546608)
Time Spent: 40m  (was: 0.5h)

> Avro DataFileReader creation can hang
> -
>
> Key: CRUNCH-698
> URL: https://issues.apache.org/jira/browse/CRUNCH-698
> Project: Crunch
>  Issue Type: Bug
>  Components: Core, IO
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
> Fix For: 1.1.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> A severe Avro bug [AVRO-2944|https://issues.apache.org/jira/browse/AVRO-2944] 
> was recently found in the static method for creating a DataFileReader 
> instance, where it can get stuck in an infinite loop while trying to read the 
> 4 byte "magic" header of the file.
> The stack trace looks like this,
> {noformat}
> "main" #1 prio=5 os_prio=0 tid=0x7f8798027000 nid=0x7d9c runnable 
> [0x7f87a0924000]
>java.lang.Thread.State: RUNNABLE
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at org.apache.avro.mapred.FsInput.read(FsInput.java:54)
>   at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:55)
>   at 
> org.apache.crunch.types.avro.AvroRecordReader.initialize(AvroRecordReader.java:58)
>   at 
> org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:152)
>   at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:571)
>   at 
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
>   at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:802)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
> {noformat}
> This was fixed in Avro 1.10.1 but has not yet been patched to any other Avro 
> versions. The issue has existed since Avro 1.5 although we have encountered 
> it recently. It does not happen in normal circumstances, there has to be some 
> very unusual input stream behavior (partial/throttled read, or unexpected 
> EOF) causing it. We've only seen it with the S3AFileSystem's S3AInputStream, 
> suddenly starting a few days ago for no apparent reason. Even now it is 
> sporadic, happening a small percent of the time in job tasks that read many 
> S3 files but often enough to be problematic. An AWS support case is open to 
> attempt to find out what could have caused this.
> To avoid the external dependency on a particular Avro version to fix this, we 
> can probably just patch this locally in Crunch since it's only one static 
> method and apart from one legacy constant everything we need access to in the 
> Avro code is public.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (CRUNCH-698) Avro DataFileReader creation can hang

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-698?focusedWorklogId=546136=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546136
 ]

ASF GitHub Bot logged work on CRUNCH-698:
-

Author: ASF GitHub Bot
Created on: 02/Feb/21 17:20
Start Date: 02/Feb/21 17:20
Worklog Time Spent: 10m 
  Work Description: jwills merged pull request #34:
URL: https://github.com/apache/crunch/pull/34


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546136)
Time Spent: 0.5h  (was: 20m)

> Avro DataFileReader creation can hang
> -
>
> Key: CRUNCH-698
> URL: https://issues.apache.org/jira/browse/CRUNCH-698
> Project: Crunch
>  Issue Type: Bug
>  Components: Core, IO
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> A severe Avro bug [AVRO-2944|https://issues.apache.org/jira/browse/AVRO-2944] 
> was recently found in the static method for creating a DataFileReader 
> instance, where it can get stuck in an infinite loop while trying to read the 
> 4 byte "magic" header of the file.
> This was fixed in Avro 1.10.1 but has not yet been patched to any other Avro 
> versions. The issue has existed since Avro 1.5 although we have encountered 
> it recently. It does not happen in normal circumstances, there has to be some 
> very unusual input stream behavior (partial/throttled read, or unexpected 
> EOF) causing it. We've only seen it with the S3AFileSystem's S3AInputStream, 
> suddenly starting a few days ago for no apparent reason. Even now it is 
> sporadic, happening a small percent of the time in job tasks that read many 
> S3 files but often enough to be problematic. An AWS support case is open to 
> attempt to find out what could have caused this.
> To avoid the external dependency on a particular Avro version to fix this, we 
> can probably just patch this locally in Crunch since it's only one static 
> method and apart from one legacy constant everything we need access to in the 
> Avro code is public.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (CRUNCH-698) Avro DataFileReader creation can hang

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-698?focusedWorklogId=546120=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546120
 ]

ASF GitHub Bot logged work on CRUNCH-698:
-

Author: ASF GitHub Bot
Created on: 02/Feb/21 16:42
Start Date: 02/Feb/21 16:42
Worklog Time Spent: 10m 
  Work Description: noslowerdna opened a new pull request #34:
URL: https://github.com/apache/crunch/pull/34


   Fixes [AVRO-2944](https://issues.apache.org/jira/browse/AVRO-2944) where 
Avro's static method for creating a DataFileReader instance can get stuck in an 
infinite loop while trying to read the 4 byte "magic" header of the file. More 
details can be found at 
[CRUNCH-698](https://issues.apache.org/jira/browse/CRUNCH-698).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546120)
Remaining Estimate: 0h
Time Spent: 10m

> Avro DataFileReader creation can hang
> -
>
> Key: CRUNCH-698
> URL: https://issues.apache.org/jira/browse/CRUNCH-698
> Project: Crunch
>  Issue Type: Bug
>  Components: Core
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A severe Avro bug [AVRO-2944|https://issues.apache.org/jira/browse/AVRO-2944] 
> was recently found in the static method for creating a DataFileReader 
> instance, where it can get stuck in an infinite loop while trying to read the 
> 4 byte "magic" header of the file.
> This was fixed in Avro 1.10.1 but has not yet been patched to any other Avro 
> versions. The issue has existed since Avro 1.5 although we have encountered 
> it recently. It does not happen in normal circumstances, there has to be some 
> very unusual input stream behavior (partial/throttled read, or unexpected 
> EOF) causing it. We've only seen it with the S3AFileSystem's S3AInputStream, 
> suddenly starting a few days ago for no apparent reason. Even now it is 
> sporadic, happening a small percent of the time in job tasks that read many 
> S3 files but often enough to be problematic. An AWS support case is open to 
> attempt to find out what could have caused this.
> To avoid the external dependency on a particular Avro version to fix this, we 
> can probably just patch this locally in Crunch since it's only one static 
> method and apart from one legacy constant everything we need access to in the 
> Avro code is public.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (CRUNCH-695) NullPointerException in RegionLocationTable

2020-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-695?focusedWorklogId=409616=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-409616
 ]

ASF GitHub Bot logged work on CRUNCH-695:
-

Author: ASF GitHub Bot
Created on: 25/Mar/20 16:14
Start Date: 25/Mar/20 16:14
Worklog Time Spent: 10m 
  Work Description: mkwhitacre commented on pull request #32: CRUNCH-695: 
Fix NullPointerException in RegionLocationTable
URL: https://github.com/apache/crunch/pull/32
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 409616)
Time Spent: 20m  (was: 10m)

> NullPointerException in RegionLocationTable
> ---
>
> Key: CRUNCH-695
> URL: https://issues.apache.org/jira/browse/CRUNCH-695
> Project: Crunch
>  Issue Type: Bug
>Reporter: Andrew Olson
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We saw this exception in processing jobs when Region Servers were abruptly 
> aborting and restarting causing offline or in-transition regions. While there 
> may be an underlying HBase client bug, the Crunch code should more gracefully 
> handle this apparent possibility.
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.crunch.io.hbase.RegionLocationTable.create(RegionLocationTable.java:63)
>   at 
> org.apache.crunch.io.hbase.HFileUtils.writeToHFilesForIncrementalLoad(HFileUtils.java:515)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (CRUNCH-695) NullPointerException in RegionLocationTable

2020-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-695?focusedWorklogId=409610=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-409610
 ]

ASF GitHub Bot logged work on CRUNCH-695:
-

Author: ASF GitHub Bot
Created on: 25/Mar/20 16:04
Start Date: 25/Mar/20 16:04
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #32: CRUNCH-695: 
Fix NullPointerException in RegionLocationTable
URL: https://github.com/apache/crunch/pull/32
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 409610)
Remaining Estimate: 0h
Time Spent: 10m

> NullPointerException in RegionLocationTable
> ---
>
> Key: CRUNCH-695
> URL: https://issues.apache.org/jira/browse/CRUNCH-695
> Project: Crunch
>  Issue Type: Bug
>Reporter: Andrew Olson
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We saw this exception in processing jobs when Region Servers were abruptly 
> aborting and restarting causing offline or in-transition regions. While there 
> may be an underlying HBase client bug, the Crunch code should more gracefully 
> handle this apparent possibility.
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.crunch.io.hbase.RegionLocationTable.create(RegionLocationTable.java:63)
>   at 
> org.apache.crunch.io.hbase.HFileUtils.writeToHFilesForIncrementalLoad(HFileUtils.java:515)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (CRUNCH-693) ParseTest fails when building on JDK 11

2020-01-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-693?focusedWorklogId=371956=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371956
 ]

ASF GitHub Bot logged work on CRUNCH-693:
-

Author: ASF GitHub Bot
Created on: 14/Jan/20 23:08
Start Date: 14/Jan/20 23:08
Worklog Time Spent: 10m 
  Work Description: jwills commented on pull request #31: CRUNCH-693: Make 
text parsing locale-independent
URL: https://github.com/apache/crunch/pull/31
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 371956)
Time Spent: 20m  (was: 10m)

> ParseTest fails when building on JDK 11
> ---
>
> Key: CRUNCH-693
> URL: https://issues.apache.org/jira/browse/CRUNCH-693
> Project: Crunch
>  Issue Type: Bug
>Reporter: Gabriel Reid
>Assignee: Gabriel Reid
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> At least in some locales, {{org.apache.crunch.contrib.text.ParseTest}} fails 
> due to alternative parsing of floating-point numbers. This also means that 
> the behavior of {{org.apache.crunch.contrib.text.Parse}} is dependent on the 
> locale of the JVM where Crunch is running (at least with JDK 11).
> It would probably be better if the behaviour was consistent, regardless of 
> the locale of the JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (CRUNCH-693) ParseTest fails when building on JDK 11

2020-01-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-693?focusedWorklogId=370359=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-370359
 ]

ASF GitHub Bot logged work on CRUNCH-693:
-

Author: ASF GitHub Bot
Created on: 11/Jan/20 15:40
Start Date: 11/Jan/20 15:40
Worklog Time Spent: 10m 
  Work Description: gabrielreid commented on pull request #31: CRUNCH-693: 
Make text parsing locale-independent
URL: https://github.com/apache/crunch/pull/31
 
 
   Standardize on US-based locale for number formatting (which is
   backwards-compatible with historical behavior).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 370359)
Remaining Estimate: 0h
Time Spent: 10m

> ParseTest fails when building on JDK 11
> ---
>
> Key: CRUNCH-693
> URL: https://issues.apache.org/jira/browse/CRUNCH-693
> Project: Crunch
>  Issue Type: Bug
>Reporter: Gabriel Reid
>Assignee: Gabriel Reid
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> At least in some locales, {{org.apache.crunch.contrib.text.ParseTest}} fails 
> due to alternative parsing of floating-point numbers. This also means that 
> the behavior of {{org.apache.crunch.contrib.text.Parse}} is dependent on the 
> locale of the JVM where Crunch is running (at least with JDK 11).
> It would probably be better if the behaviour was consistent, regardless of 
> the locale of the JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (CRUNCH-688) HFile node affinity only works with default namespace HBase tables

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-688?focusedWorklogId=288324=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288324
 ]

ASF GitHub Bot logged work on CRUNCH-688:
-

Author: ASF GitHub Bot
Created on: 02/Aug/19 23:12
Start Date: 02/Aug/19 23:12
Worklog Time Spent: 10m 
  Work Description: jwills commented on issue #27: CRUNCH-688: Fix HFile 
node affinity for non-default namespace HBase t…
URL: https://github.com/apache/crunch/pull/27#issuecomment-517869078
 
 
   amazing that merges Just Work now. Thank you Andrew!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288324)
Time Spent: 40m  (was: 0.5h)

> HFile node affinity only works with default namespace HBase tables
> --
>
> Key: CRUNCH-688
> URL: https://issues.apache.org/jira/browse/CRUNCH-688
> Project: Crunch
>  Issue Type: Bug
>Reporter: Andrew Olson
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> A problem was found with CRUNCH-644 which introduced HFile node affinity when 
> using a non-default namespaced HBase table,
> {noformat}
> Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
> Relative path in absolute URI: regionLocations_myTableNamespace:myTableName
> at org.apache.hadoop.fs.Path.initialize(Path.java:206)
> at org.apache.hadoop.fs.Path.(Path.java:172)
> at org.apache.hadoop.fs.Path.(Path.java:94)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writeToHFilesForIncrementalLoad(HFileUtils.java:517)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:608)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:578)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:542)
> ... 
>  {noformat}
> The ":" delimiter in the qualified table name isn't a valid path element
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/model.html#Paths_and_Path_Elements



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (CRUNCH-688) HFile node affinity only works with default namespace HBase tables

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-688?focusedWorklogId=288323=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288323
 ]

ASF GitHub Bot logged work on CRUNCH-688:
-

Author: ASF GitHub Bot
Created on: 02/Aug/19 23:12
Start Date: 02/Aug/19 23:12
Worklog Time Spent: 10m 
  Work Description: jwills commented on pull request #27: CRUNCH-688: Fix 
HFile node affinity for non-default namespace HBase t…
URL: https://github.com/apache/crunch/pull/27
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288323)
Time Spent: 0.5h  (was: 20m)

> HFile node affinity only works with default namespace HBase tables
> --
>
> Key: CRUNCH-688
> URL: https://issues.apache.org/jira/browse/CRUNCH-688
> Project: Crunch
>  Issue Type: Bug
>Reporter: Andrew Olson
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> A problem was found with CRUNCH-644 which introduced HFile node affinity when 
> using a non-default namespaced HBase table,
> {noformat}
> Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
> Relative path in absolute URI: regionLocations_myTableNamespace:myTableName
> at org.apache.hadoop.fs.Path.initialize(Path.java:206)
> at org.apache.hadoop.fs.Path.(Path.java:172)
> at org.apache.hadoop.fs.Path.(Path.java:94)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writeToHFilesForIncrementalLoad(HFileUtils.java:517)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:608)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:578)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:542)
> ... 
>  {noformat}
> The ":" delimiter in the qualified table name isn't a valid path element
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/model.html#Paths_and_Path_Elements



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (CRUNCH-688) HFile node affinity only works with default namespace HBase tables

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-688?focusedWorklogId=288295=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288295
 ]

ASF GitHub Bot logged work on CRUNCH-688:
-

Author: ASF GitHub Bot
Created on: 02/Aug/19 21:54
Start Date: 02/Aug/19 21:54
Worklog Time Spent: 10m 
  Work Description: jwills commented on issue #27: CRUNCH-688: Fix HFile 
node affinity for non-default namespace HBase t…
URL: https://github.com/apache/crunch/pull/27#issuecomment-517855010
 
 
   Nice, looking
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288295)
Time Spent: 20m  (was: 10m)

> HFile node affinity only works with default namespace HBase tables
> --
>
> Key: CRUNCH-688
> URL: https://issues.apache.org/jira/browse/CRUNCH-688
> Project: Crunch
>  Issue Type: Bug
>Reporter: Andrew Olson
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> A problem was found with CRUNCH-644 which introduced HFile node affinity when 
> using a non-default namespaced HBase table,
> {noformat}
> Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
> Relative path in absolute URI: regionLocations_myTableNamespace:myTableName
> at org.apache.hadoop.fs.Path.initialize(Path.java:206)
> at org.apache.hadoop.fs.Path.(Path.java:172)
> at org.apache.hadoop.fs.Path.(Path.java:94)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writeToHFilesForIncrementalLoad(HFileUtils.java:517)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:608)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:578)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:542)
> ... 
>  {noformat}
> The ":" delimiter in the qualified table name isn't a valid path element
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/model.html#Paths_and_Path_Elements



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (CRUNCH-688) HFile node affinity only works with default namespace HBase tables

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-688?focusedWorklogId=288294=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288294
 ]

ASF GitHub Bot logged work on CRUNCH-688:
-

Author: ASF GitHub Bot
Created on: 02/Aug/19 21:48
Start Date: 02/Aug/19 21:48
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #27: CRUNCH-688: 
Fix HFile node affinity for non-default namespace HBase t…
URL: https://github.com/apache/crunch/pull/27
 
 
   …ables
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288294)
Time Spent: 10m
Remaining Estimate: 0h

> HFile node affinity only works with default namespace HBase tables
> --
>
> Key: CRUNCH-688
> URL: https://issues.apache.org/jira/browse/CRUNCH-688
> Project: Crunch
>  Issue Type: Bug
>Reporter: Andrew Olson
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A problem was found with CRUNCH-644 which introduced HFile node affinity when 
> using a non-default namespaced HBase table,
> {noformat}
> Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
> Relative path in absolute URI: regionLocations_myTableNamespace:myTableName
> at org.apache.hadoop.fs.Path.initialize(Path.java:206)
> at org.apache.hadoop.fs.Path.(Path.java:172)
> at org.apache.hadoop.fs.Path.(Path.java:94)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writeToHFilesForIncrementalLoad(HFileUtils.java:517)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:608)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:578)
> at 
> org.apache.crunch.io.hbase.HFileUtils.writePutsToHFilesForIncrementalLoad(HFileUtils.java:542)
> ... 
>  {noformat}
> The ":" delimiter in the qualified table name isn't a valid path element
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/model.html#Paths_and_Path_Elements



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (CRUNCH-679) Improvements for usage of DistCp

2019-07-15 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-679?focusedWorklogId=276834=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276834
 ]

ASF GitHub Bot logged work on CRUNCH-679:
-

Author: ASF GitHub Bot
Created on: 15/Jul/19 16:42
Start Date: 15/Jul/19 16:42
Worklog Time Spent: 10m 
  Work Description: mkwhitacre commented on pull request #20: CRUNCH-679: 
Improvements for usage of DistCp
URL: https://github.com/apache/crunch/pull/20
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276834)
Time Spent: 1.5h  (was: 1h 20m)

> Improvements for usage of DistCp
> 
>
> Key: CRUNCH-679
> URL: https://issues.apache.org/jira/browse/CRUNCH-679
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and 
> improvements have been identified during testing.
> * We need to preserve preferred part names, e.g. part-m-0. Currently the 
> DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile 
> method, and would therefore create destination file names like out0-m-0, 
> which are problematic when there are multiple map-only jobs writing to the 
> same target path. This can be achieved by providing a custom CopyListing 
> implementation that is capable of dynamically renaming target paths based on 
> a given mapping. Unfortunately a substantial amount of code duplication from 
> the original SimpleCopyListing class is currently required in order to inject 
> the necessary logic for modifying the sequence file entry keys. HADOOP-16147 
> has been opened to allow it to be simplified in the future.
> * The handleOutputs implementation in HFileTarget is essentially identical to 
> the one in FileTargetImpl that it overrides. We can remove it and just share 
> the same code.
> * It could be useful to add a property for configuring the max DistCp task 
> bandwidth, as the default (100 MB/s per task) may be too high for certain 
> environments.
> * The default of 1000 for max DistCp map tasks may be too high in some 
> situations resulting in 503 Slow Down errors from S3 especially if there are 
> multiple jobs writing into the same bucket. Reducing to 100 should help 
> prevent issues along those lines while still providing adequate parallel 
> throughput.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (CRUNCH-679) Improvements for usage of DistCp

2019-07-15 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-679?focusedWorklogId=276781=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276781
 ]

ASF GitHub Bot logged work on CRUNCH-679:
-

Author: ASF GitHub Bot
Created on: 15/Jul/19 15:29
Start Date: 15/Jul/19 15:29
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #20: CRUNCH-679: 
Improvements for usage of DistCp
URL: https://github.com/apache/crunch/pull/20#discussion_r303498022
 
 

 ##
 File path: 
crunch-core/src/main/java/org/apache/crunch/util/CrunchRenameCopyListing.java
 ##
 @@ -0,0 +1,272 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information regarding copyright 
ownership.  The ASF licenses this file to you under the
+ * Apache License, Version 2.0 (the "License"); you may not use this file 
except in compliance with the License.  You may obtain a
+ * copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License is distributed on an "AS IS"
+ * BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied. See the License for the specific language
+ * governing permissions and limitations under the License.
+ */
+package org.apache.crunch.util;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.SequenceFile;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.security.Credentials;
+import org.apache.hadoop.tools.CopyListing;
+import org.apache.hadoop.tools.CopyListingFileStatus;
+import org.apache.hadoop.tools.DistCpOptions;
+import org.apache.hadoop.tools.DistCpOptions.FileAttribute;
+import org.apache.hadoop.tools.SimpleCopyListing;
+import org.apache.hadoop.tools.util.DistCpUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Stack;
+
+/**
+ * A custom {@link CopyListing} implementation capable of dynamically renaming
+ * the target paths according to a {@link #DISTCP_PATH_RENAMES configured set 
of values}.
+ * 
+ * Once https://issues.apache.org/jira/browse/HADOOP-16147 is available, this
+ * class can be significantly simplified.
+ * 
+ */
+public class CrunchRenameCopyListing extends SimpleCopyListing {
+  /**
+   * Comma-separated list of original-file:renamed-file path rename pairs.
+   */
+  public static final String DISTCP_PATH_RENAMES = 
"crunch.distcp.path.renames";
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(CrunchRenameCopyListing.class);
+  private final Map pathRenames;
+
+  private long totalPaths = 0;
+  private long totalBytesToCopy = 0;
+
+  /**
+   * Constructor, to initialize configuration.
+   *
+   * @param configuration The input configuration, with which the 
source/target FileSystems may be accessed.
+   * @param credentials - Credentials object on which the FS delegation tokens 
are cached. If null
+   * delegation token caching is skipped
+   */
+  public CrunchRenameCopyListing(Configuration configuration, Credentials 
credentials) {
+super(configuration, credentials);
+
+pathRenames = new HashMap<>();
+
+String[] pathRenameConf = configuration.getStrings(DISTCP_PATH_RENAMES);
+if (pathRenameConf == null) {
+  throw new IllegalArgumentException("Missing required configuration: " + 
DISTCP_PATH_RENAMES);
+}
+for (String pathRename : pathRenameConf) {
+  String[] pathRenameParts = pathRename.split(":");
+  if (pathRenameParts.length != 2) {
+throw new IllegalArgumentException("Invalid path rename format: " + 
pathRename);
+  }
+  if (pathRenames.put(pathRenameParts[0], pathRenameParts[1]) != null) {
+throw new IllegalArgumentException("Invalid duplicate path rename: " + 
pathRenameParts[0]);
+  }
+}
+LOG.info("Loaded {} path rename entries", pathRenames.size());
+
+// Clear out the rename configuration property, as it is no longer needed
+configuration.unset(DISTCP_PATH_RENAMES);
+  }
+
+  @Override
+  public void doBuildListing(SequenceFile.Writer fileListWriter, DistCpOptions 
options) throws IOException {
+try {
+  for (Path path : options.getSourcePaths()) {
+FileSystem sourceFS = path.getFileSystem(getConf());
+final boolean preserveAcls = options.shouldPreserve(FileAttribute.ACL);
+final boolean preserveXAttrs = 
options.shouldPreserve(FileAttribute.XATTR);
+final boolean preserveRawXAttrs = options.shouldPreserveRawXattrs();
+path = makeQualified(path);
+
+FileStatus 

[jira] [Work logged] (CRUNCH-679) Improvements for usage of DistCp

2019-07-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-679?focusedWorklogId=276164=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276164
 ]

ASF GitHub Bot logged work on CRUNCH-679:
-

Author: ASF GitHub Bot
Created on: 12/Jul/19 21:48
Start Date: 12/Jul/19 21:48
Worklog Time Spent: 10m 
  Work Description: mkwhitacre commented on pull request #20: CRUNCH-679: 
Improvements for usage of DistCp
URL: https://github.com/apache/crunch/pull/20#discussion_r303159612
 
 

 ##
 File path: 
crunch-core/src/main/java/org/apache/crunch/util/CrunchRenameCopyListing.java
 ##
 @@ -0,0 +1,272 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information regarding copyright 
ownership.  The ASF licenses this file to you under the
+ * Apache License, Version 2.0 (the "License"); you may not use this file 
except in compliance with the License.  You may obtain a
+ * copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License is distributed on an "AS IS"
+ * BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied. See the License for the specific language
+ * governing permissions and limitations under the License.
+ */
+package org.apache.crunch.util;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.SequenceFile;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.security.Credentials;
+import org.apache.hadoop.tools.CopyListing;
+import org.apache.hadoop.tools.CopyListingFileStatus;
+import org.apache.hadoop.tools.DistCpOptions;
+import org.apache.hadoop.tools.DistCpOptions.FileAttribute;
+import org.apache.hadoop.tools.SimpleCopyListing;
+import org.apache.hadoop.tools.util.DistCpUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Stack;
+
+/**
+ * A custom {@link CopyListing} implementation capable of dynamically renaming
+ * the target paths according to a {@link #DISTCP_PATH_RENAMES configured set 
of values}.
+ * 
+ * Once https://issues.apache.org/jira/browse/HADOOP-16147 is available, this
+ * class can be significantly simplified.
+ * 
+ */
+public class CrunchRenameCopyListing extends SimpleCopyListing {
+  /**
+   * Comma-separated list of original-file:renamed-file path rename pairs.
+   */
+  public static final String DISTCP_PATH_RENAMES = 
"crunch.distcp.path.renames";
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(CrunchRenameCopyListing.class);
+  private final Map pathRenames;
+
+  private long totalPaths = 0;
+  private long totalBytesToCopy = 0;
+
+  /**
+   * Constructor, to initialize configuration.
+   *
+   * @param configuration The input configuration, with which the 
source/target FileSystems may be accessed.
+   * @param credentials - Credentials object on which the FS delegation tokens 
are cached. If null
+   * delegation token caching is skipped
+   */
+  public CrunchRenameCopyListing(Configuration configuration, Credentials 
credentials) {
+super(configuration, credentials);
+
+pathRenames = new HashMap<>();
+
+String[] pathRenameConf = configuration.getStrings(DISTCP_PATH_RENAMES);
+if (pathRenameConf == null) {
+  throw new IllegalArgumentException("Missing required configuration: " + 
DISTCP_PATH_RENAMES);
+}
+for (String pathRename : pathRenameConf) {
+  String[] pathRenameParts = pathRename.split(":");
+  if (pathRenameParts.length != 2) {
+throw new IllegalArgumentException("Invalid path rename format: " + 
pathRename);
+  }
+  if (pathRenames.put(pathRenameParts[0], pathRenameParts[1]) != null) {
+throw new IllegalArgumentException("Invalid duplicate path rename: " + 
pathRenameParts[0]);
+  }
+}
+LOG.info("Loaded {} path rename entries", pathRenames.size());
+
+// Clear out the rename configuration property, as it is no longer needed
+configuration.unset(DISTCP_PATH_RENAMES);
+  }
+
+  @Override
+  public void doBuildListing(SequenceFile.Writer fileListWriter, DistCpOptions 
options) throws IOException {
+try {
+  for (Path path : options.getSourcePaths()) {
+FileSystem sourceFS = path.getFileSystem(getConf());
+final boolean preserveAcls = options.shouldPreserve(FileAttribute.ACL);
+final boolean preserveXAttrs = 
options.shouldPreserve(FileAttribute.XATTR);
+final boolean preserveRawXAttrs = options.shouldPreserveRawXattrs();
+path = makeQualified(path);
+
+FileStatus 

[jira] [Work logged] (CRUNCH-681) HFileUtils. writeToHFilesForIncrementalLoad() should accept a FileSystem parameter

2019-07-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-681?focusedWorklogId=276163=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276163
 ]

ASF GitHub Bot logged work on CRUNCH-681:
-

Author: ASF GitHub Bot
Created on: 12/Jul/19 21:43
Start Date: 12/Jul/19 21:43
Worklog Time Spent: 10m 
  Work Description: mkwhitacre commented on pull request #22: CRUNCH-681: 
Updating HFileUtils to accept a filesystem parameter for …
URL: https://github.com/apache/crunch/pull/22
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276163)
Time Spent: 1h  (was: 50m)

> HFileUtils. writeToHFilesForIncrementalLoad() should accept a FileSystem 
> parameter
> --
>
> Key: CRUNCH-681
> URL: https://issues.apache.org/jira/browse/CRUNCH-681
> Project: Crunch
>  Issue Type: Improvement
>Affects Versions: 0.14.0
>Reporter: Ben Roling
>Priority: Major
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> With CRUNCH-677 in, HFileUtils.writeToHFilesForIncrementalLoad() should have 
> a form that accepts a FileSystem and propagates that FileSystem to the 
> HFileTarget.  This enables writing HFiles to FileSystems not included in the 
> Configuration of the Pipeline itself.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (CRUNCH-685) Limit Target#fileSystem(FileSystem) to only apply filesystem specific configurations to the FormatBundle

2019-07-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-685?focusedWorklogId=276161=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276161
 ]

ASF GitHub Bot logged work on CRUNCH-685:
-

Author: ASF GitHub Bot
Created on: 12/Jul/19 21:36
Start Date: 12/Jul/19 21:36
Worklog Time Spent: 10m 
  Work Description: mkwhitacre commented on pull request #25: CRUNCH-685 
Use whitelist and blacklist for .fileSystem() properties
URL: https://github.com/apache/crunch/pull/25
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276161)
Time Spent: 20m  (was: 10m)

> Limit Target#fileSystem(FileSystem) to only apply filesystem specific 
> configurations to the FormatBundle
> 
>
> Key: CRUNCH-685
> URL: https://issues.apache.org/jira/browse/CRUNCH-685
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Nathan Schile
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I have an application that runs multiple Crunch pipelines. The first pipeline 
> (P1) reads data from HDFS and completes successfully. The second pipeline 
> (P2) writes data to the same HDFS that was used in the P1 pipeline. The 
> Target configuration for the P2 pipeline is configured by utilizing the 
> Target#fileSystem(FileSystem) method. The P2 pipeline fails when committing 
> the job [1]. It fails when attempting to read a temporary directory from the 
> P1 pipeline, which was already deleted when the P1 pipeline completed.
> The failure is occurring because the Hadoop Filesystem uses an internal cache 
> [2] to cache Filesystems. The first pipeline create a FileSystem object that 
> contains the configuration 
> "mapreduce.output.fileoutputformat.outputdir":"hdfs://my-cluster/tmp/crunch-897836570/p2/output".
>  When the P2 pipeline runs it invokes Target#fileSystem(FileSystem) which 
> uses the cached FileSystem from P1 pipeline. The 
> Target#fileSystem(FileSystem) method copies the configuration from the 
> filesystem to the FormatBundle, which causes the erroneous 
> "mapreduce.output.fileoutputformat.outputdir" to be set.
> [1]
> {noformat}
> java.io.FileNotFoundException: File 
> hdfs://my-cluster/tmp/crunch-897836570/p2/output/_temporary/1 does not exist.
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:747)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:113)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:808)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:804)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:804)
>   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566)
>   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:322)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:392)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:365)
>   at 
> org.apache.crunch.io.CrunchOutputs$CompositeOutputCommitter.commitJob(CrunchOutputs.java:379)
>   at 
> org.apache.crunch.io.CrunchOutputs$CompositeOutputCommitter.commitJob(CrunchOutputs.java:379)
>   at 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:285)
>   at 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> [2] 
> http://johnjianfang.blogspot.com/2015/03/hadoop-filesystem-internal-cache.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (CRUNCH-683) Avoid unnecessary listStatus calls from getSize computation

2019-07-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-683?focusedWorklogId=276160=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276160
 ]

ASF GitHub Bot logged work on CRUNCH-683:
-

Author: ASF GitHub Bot
Created on: 12/Jul/19 21:30
Start Date: 12/Jul/19 21:30
Worklog Time Spent: 10m 
  Work Description: mkwhitacre commented on pull request #26: CRUNCH-683 
avoid unnecessary listStatus() calls from getPathSize()
URL: https://github.com/apache/crunch/pull/26
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276160)
Time Spent: 1h 10m  (was: 1h)

> Avoid unnecessary listStatus calls from getSize computation
> ---
>
> Key: CRUNCH-683
> URL: https://issues.apache.org/jira/browse/CRUNCH-683
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.14.0
>Reporter: Jon Hemphill
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The getPathSize computation in SourceTargetHelper currently makes unnecessary 
> listStatus calls when recursing over a directory, which can cause performance 
> issues when the filesystem is an object store such as S3. The performance can 
> be improved with the addition of a private method to use for the getPathSize 
> recursion that takes a known FIleStatus object as a parameter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (CRUNCH-683) Avoid unnecessary listStatus calls from getSize computation

2019-07-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-683?focusedWorklogId=276158=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276158
 ]

ASF GitHub Bot logged work on CRUNCH-683:
-

Author: ASF GitHub Bot
Created on: 12/Jul/19 21:28
Start Date: 12/Jul/19 21:28
Worklog Time Spent: 10m 
  Work Description: mkwhitacre commented on issue #23: CRUNCH-683 Avoid 
unnecessary listStatus calls from getPathSize computation
URL: https://github.com/apache/crunch/pull/23#issuecomment-511039002
 
 
   #26 supersedes these changes.  Closing this PR.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276158)
Time Spent: 50m  (was: 40m)

> Avoid unnecessary listStatus calls from getSize computation
> ---
>
> Key: CRUNCH-683
> URL: https://issues.apache.org/jira/browse/CRUNCH-683
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.14.0
>Reporter: Jon Hemphill
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The getPathSize computation in SourceTargetHelper currently makes unnecessary 
> listStatus calls when recursing over a directory, which can cause performance 
> issues when the filesystem is an object store such as S3. The performance can 
> be improved with the addition of a private method to use for the getPathSize 
> recursion that takes a known FIleStatus object as a parameter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (CRUNCH-683) Avoid unnecessary listStatus calls from getSize computation

2019-07-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-683?focusedWorklogId=276159=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-276159
 ]

ASF GitHub Bot logged work on CRUNCH-683:
-

Author: ASF GitHub Bot
Created on: 12/Jul/19 21:28
Start Date: 12/Jul/19 21:28
Worklog Time Spent: 10m 
  Work Description: mkwhitacre commented on pull request #23: CRUNCH-683 
Avoid unnecessary listStatus calls from getPathSize computation
URL: https://github.com/apache/crunch/pull/23
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 276159)
Time Spent: 1h  (was: 50m)

> Avoid unnecessary listStatus calls from getSize computation
> ---
>
> Key: CRUNCH-683
> URL: https://issues.apache.org/jira/browse/CRUNCH-683
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.14.0
>Reporter: Jon Hemphill
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The getPathSize computation in SourceTargetHelper currently makes unnecessary 
> listStatus calls when recursing over a directory, which can cause performance 
> issues when the filesystem is an object store such as S3. The performance can 
> be improved with the addition of a private method to use for the getPathSize 
> recursion that takes a known FIleStatus object as a parameter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (CRUNCH-683) Avoid unnecessary listStatus calls from getSize computation

2019-05-21 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-683?focusedWorklogId=246241=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-246241
 ]

ASF GitHub Bot logged work on CRUNCH-683:
-

Author: ASF GitHub Bot
Created on: 21/May/19 17:31
Start Date: 21/May/19 17:31
Worklog Time Spent: 10m 
  Work Description: ben-roling commented on pull request #26: CRUNCH-683 
avoid unnecessary listStatus() calls from getPathSize()
URL: https://github.com/apache/crunch/pull/26
 
 
   Since @jonhemphill has been tied up in other things and hasn't been able to 
complete #23, I'm throwing this out to try to wrap it up.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 246241)
Time Spent: 40m  (was: 0.5h)

> Avoid unnecessary listStatus calls from getSize computation
> ---
>
> Key: CRUNCH-683
> URL: https://issues.apache.org/jira/browse/CRUNCH-683
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.14.0
>Reporter: Jon Hemphill
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The getPathSize computation in SourceTargetHelper currently makes unnecessary 
> listStatus calls when recursing over a directory, which can cause performance 
> issues when the filesystem is an object store such as S3. The performance can 
> be improved with the addition of a private method to use for the getPathSize 
> recursion that takes a known FIleStatus object as a parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-685) Limit Target#fileSystem(FileSystem) to only apply filesystem specific configurations to the FormatBundle

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-685?focusedWorklogId=241888=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241888
 ]

ASF GitHub Bot logged work on CRUNCH-685:
-

Author: ASF GitHub Bot
Created on: 14/May/19 17:26
Start Date: 14/May/19 17:26
Worklog Time Spent: 10m 
  Work Description: ben-roling commented on pull request #25: CRUNCH-685 
Use whitelist and blacklist for .fileSystem() properties
URL: https://github.com/apache/crunch/pull/25
 
 
   Addresses [CRUNCH-685](https://issues.apache.org/jira/browse/CRUNCH-685) by 
introducing whitelist and blacklist configs that control which FileSystem 
config properties are merged into the FormatBundle.
   
   Also adds logging to make it more transparent as to what is going on.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 241888)
Time Spent: 10m
Remaining Estimate: 0h

> Limit Target#fileSystem(FileSystem) to only apply filesystem specific 
> configurations to the FormatBundle
> 
>
> Key: CRUNCH-685
> URL: https://issues.apache.org/jira/browse/CRUNCH-685
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Nathan Schile
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I have an application that runs multiple Crunch pipelines. The first pipeline 
> (P1) reads data from HDFS and completes successfully. The second pipeline 
> (P2) writes data to the same HDFS that was used in the P1 pipeline. The 
> Target configuration for the P2 pipeline is configured by utilizing the 
> Target#fileSystem(FileSystem) method. The P2 pipeline fails when committing 
> the job [1]. It fails when attempting to read a temporary directory from the 
> P1 pipeline, which was already deleted when the P1 pipeline completed.
> The failure is occurring because the Hadoop Filesystem uses an internal cache 
> [2] to cache Filesystems. The first pipeline create a FileSystem object that 
> contains the configuration 
> "mapreduce.output.fileoutputformat.outputdir":"hdfs://my-cluster/tmp/crunch-897836570/p2/output".
>  When the P2 pipeline runs it invokes Target#fileSystem(FileSystem) which 
> uses the cached FileSystem from P1 pipeline. The 
> Target#fileSystem(FileSystem) method copies the configuration from the 
> filesystem to the FormatBundle, which causes the erroneous 
> "mapreduce.output.fileoutputformat.outputdir" to be set.
> [1]
> {noformat}
> java.io.FileNotFoundException: File 
> hdfs://my-cluster/tmp/crunch-897836570/p2/output/_temporary/1 does not exist.
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:747)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:113)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:808)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:804)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:804)
>   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566)
>   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:322)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:392)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:365)
>   at 
> org.apache.crunch.io.CrunchOutputs$CompositeOutputCommitter.commitJob(CrunchOutputs.java:379)
>   at 
> org.apache.crunch.io.CrunchOutputs$CompositeOutputCommitter.commitJob(CrunchOutputs.java:379)
>   at 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:285)
>   at 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

[jira] [Work logged] (CRUNCH-684) [crunch-hbase] HbaseTarget getting ignored even if configuration is different

2019-05-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-684?focusedWorklogId=236523=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-236523
 ]

ASF GitHub Bot logged work on CRUNCH-684:
-

Author: ASF GitHub Bot
Created on: 02/May/19 19:54
Start Date: 02/May/19 19:54
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on issue #24: CRUNCH-684: Fix 
.equals and .hashCode for Targets
URL: https://github.com/apache/crunch/pull/24#issuecomment-488809416
 
 
   This was committed to master, closing
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 236523)
Time Spent: 20m  (was: 10m)

> [crunch-hbase] HbaseTarget getting ignored even if configuration is different
> -
>
> Key: CRUNCH-684
> URL: https://issues.apache.org/jira/browse/CRUNCH-684
> Project: Crunch
>  Issue Type: Improvement
>Reporter: Keerthi Yanda
>Assignee: Josh Wills
>Priority: Minor
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *Current Scenario*
> * We are trying to perform put operations for a table on different clusters 
> with the same table name. Below is the code that we are using to perform 
> write operation:
> {code:java}
> pipeline.write(PCollection, HbaseTarget, WriteMode.APPEND)
> {code}
> * Pipeline adds this target instance to "appendedTargets" and "outputTargets" 
> instances (which are HashSets)
> *Issue:*
> * As HbaseTarget's hashCode() and equals() methods are only checking for 
> tableName, HbaseTarget with different configuration properties is getting 
> ignored while adding it to appendedTargets/outputTargets.
> *Proposal*
> * Do we need to consider both tableName and "hbase.zookeeper.quorum" property 
> from "extraConf" to identify if the table(or hbaseTarget) is unique or not?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-684) [crunch-hbase] HbaseTarget getting ignored even if configuration is different

2019-05-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-684?focusedWorklogId=236524=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-236524
 ]

ASF GitHub Bot logged work on CRUNCH-684:
-

Author: ASF GitHub Bot
Created on: 02/May/19 19:54
Start Date: 02/May/19 19:54
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #24: CRUNCH-684: 
Fix .equals and .hashCode for Targets
URL: https://github.com/apache/crunch/pull/24
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 236524)
Time Spent: 0.5h  (was: 20m)

> [crunch-hbase] HbaseTarget getting ignored even if configuration is different
> -
>
> Key: CRUNCH-684
> URL: https://issues.apache.org/jira/browse/CRUNCH-684
> Project: Crunch
>  Issue Type: Improvement
>Reporter: Keerthi Yanda
>Assignee: Josh Wills
>Priority: Minor
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> *Current Scenario*
> * We are trying to perform put operations for a table on different clusters 
> with the same table name. Below is the code that we are using to perform 
> write operation:
> {code:java}
> pipeline.write(PCollection, HbaseTarget, WriteMode.APPEND)
> {code}
> * Pipeline adds this target instance to "appendedTargets" and "outputTargets" 
> instances (which are HashSets)
> *Issue:*
> * As HbaseTarget's hashCode() and equals() methods are only checking for 
> tableName, HbaseTarget with different configuration properties is getting 
> ignored while adding it to appendedTargets/outputTargets.
> *Proposal*
> * Do we need to consider both tableName and "hbase.zookeeper.quorum" property 
> from "extraConf" to identify if the table(or hbaseTarget) is unique or not?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-684) [crunch-hbase] HbaseTarget getting ignored even if configuration is different

2019-05-01 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-684?focusedWorklogId=236002=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-236002
 ]

ASF GitHub Bot logged work on CRUNCH-684:
-

Author: ASF GitHub Bot
Created on: 01/May/19 21:21
Start Date: 01/May/19 21:21
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #24: CRUNCH-684: 
Fix .equals and .hashCode for Targets
URL: https://github.com/apache/crunch/pull/24
 
 
   Previously the `equals` and `hashCode` methods for the `Target` 
implementations Crunch provides did not consider all available information when 
determining uniqueness. `FileTargetImpl` only used the path, and `HBaseTarget` 
only the table name. This could result in situations where a target was 
silently ignored because of how a `Set` is used in various places for holding a 
pipeline's collection of targets. For HBase especially, the 
`hbase.zookeeper.quorum` configuration if supplied can change where the table 
actually resides.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 236002)
Time Spent: 10m
Remaining Estimate: 0h

> [crunch-hbase] HbaseTarget getting ignored even if configuration is different
> -
>
> Key: CRUNCH-684
> URL: https://issues.apache.org/jira/browse/CRUNCH-684
> Project: Crunch
>  Issue Type: Improvement
>Reporter: Keerthi Yanda
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Current Scenario*
> * We are trying to perform put operations for a table on different clusters 
> with the same table name. Below is the code that we are using to perform 
> write operation:
> {code:java}
> pipeline.write(PCollection, HbaseTarget, WriteMode.APPEND)
> {code}
> * Pipeline adds this target instance to "appendedTargets" and "outputTargets" 
> instances (which are HashSets)
> *Issue:*
> * As HbaseTarget's hashCode() and equals() methods are only checking for 
> tableName, HbaseTarget with different configuration properties is getting 
> ignored while adding it to appendedTargets/outputTargets.
> *Proposal*
> * Do we need to consider both tableName and "hbase.zookeeper.quorum" property 
> from "extraConf" to identify if the table(or hbaseTarget) is unique or not?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-683) Avoid unnecessary listStatus calls from getSize computation

2019-04-25 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-683?focusedWorklogId=232852=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-232852
 ]

ASF GitHub Bot logged work on CRUNCH-683:
-

Author: ASF GitHub Bot
Created on: 25/Apr/19 13:21
Start Date: 25/Apr/19 13:21
Worklog Time Spent: 10m 
  Work Description: jonhemphill commented on pull request #23: CRUNCH-683 
Avoid unnecessary listStatus calls from getPathSize computation
URL: https://github.com/apache/crunch/pull/23#discussion_r278546913
 
 

 ##
 File path: 
crunch-core/src/main/java/org/apache/crunch/io/SourceTargetHelper.java
 ##
 @@ -41,17 +41,23 @@ public static long getPathSize(FileSystem fs, Path path) 
throws IOException {
 }
 long size = 0;
 for (FileStatus status : stati) {
-  if (status.isDir()) {
-for (FileStatus st : fs.listStatus(status.getPath())) {
-  size += getPathSize(fs, st.getPath());
-}
-  } else {
-size += status.getLen();
-  }
+  size += getPathSize(fs, status);
 }
 return size;
   }
-  
+
+  private static long getPathSize(final FileSystem fs, final FileStatus 
status) throws IOException {
 
 Review comment:
   Thanks for the feedback @steveloughran. I have it updated locally to use 
`listFiles` as you suggest and it is working well, I just need to update the 
tests now before commiting.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 232852)
Time Spent: 0.5h  (was: 20m)

> Avoid unnecessary listStatus calls from getSize computation
> ---
>
> Key: CRUNCH-683
> URL: https://issues.apache.org/jira/browse/CRUNCH-683
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.14.0
>Reporter: Jon Hemphill
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The getPathSize computation in SourceTargetHelper currently makes unnecessary 
> listStatus calls when recursing over a directory, which can cause performance 
> issues when the filesystem is an object store such as S3. The performance can 
> be improved with the addition of a private method to use for the getPathSize 
> recursion that takes a known FIleStatus object as a parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-683) Avoid unnecessary listStatus calls from getSize computation

2019-04-24 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-683?focusedWorklogId=231949=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-231949
 ]

ASF GitHub Bot logged work on CRUNCH-683:
-

Author: ASF GitHub Bot
Created on: 24/Apr/19 09:47
Start Date: 24/Apr/19 09:47
Worklog Time Spent: 10m 
  Work Description: steveloughran commented on pull request #23: CRUNCH-683 
Avoid unnecessary listStatus calls from getPathSize computation
URL: https://github.com/apache/crunch/pull/23#discussion_r278042702
 
 

 ##
 File path: 
crunch-core/src/main/java/org/apache/crunch/io/SourceTargetHelper.java
 ##
 @@ -41,17 +41,23 @@ public static long getPathSize(FileSystem fs, Path path) 
throws IOException {
 }
 long size = 0;
 for (FileStatus status : stati) {
-  if (status.isDir()) {
-for (FileStatus st : fs.listStatus(status.getPath())) {
-  size += getPathSize(fs, st.getPath());
-}
-  } else {
-size += status.getLen();
-  }
+  size += getPathSize(fs, status);
 }
 return size;
   }
-  
+
+  private static long getPathSize(final FileSystem fs, final FileStatus 
status) throws IOException {
 
 Review comment:
   This is still doing a recursive treewalk. IF you do 
`FileSystem.listFiles(path, true)` you get a deep listing of files only from 
the store. For an object store with an optimised implementation (as s3a does), 
only one HTTP request is made per 1000 objects, irrespective of the depth of 
the tree. For HDFS  there's still efficiencies, especially when you have a 
directory with many millions of files in: in listStatus() all the results have 
to get serialized and marshalled over as one, rather than paged over to the 
client
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 231949)
Time Spent: 20m  (was: 10m)

> Avoid unnecessary listStatus calls from getSize computation
> ---
>
> Key: CRUNCH-683
> URL: https://issues.apache.org/jira/browse/CRUNCH-683
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.14.0
>Reporter: Jon Hemphill
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The getPathSize computation in SourceTargetHelper currently makes unnecessary 
> listStatus calls when recursing over a directory, which can cause performance 
> issues when the filesystem is an object store such as S3. The performance can 
> be improved with the addition of a private method to use for the getPathSize 
> recursion that takes a known FIleStatus object as a parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-683) Avoid unnecessary listStatus calls from getSize computation

2019-04-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-683?focusedWorklogId=230813=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-230813
 ]

ASF GitHub Bot logged work on CRUNCH-683:
-

Author: ASF GitHub Bot
Created on: 22/Apr/19 18:09
Start Date: 22/Apr/19 18:09
Worklog Time Spent: 10m 
  Work Description: jonhemphill commented on pull request #23: CRUNCH-683 
Avoid unnecessary listStatus calls from getPathSize computation
URL: https://github.com/apache/crunch/pull/23
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 230813)
Time Spent: 10m
Remaining Estimate: 0h

> Avoid unnecessary listStatus calls from getSize computation
> ---
>
> Key: CRUNCH-683
> URL: https://issues.apache.org/jira/browse/CRUNCH-683
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.14.0
>Reporter: Jon Hemphill
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The getPathSize computation in SourceTargetHelper currently makes unnecessary 
> listStatus calls when recursing over a directory, which can cause performance 
> issues when the filesystem is an object store such as S3. The performance can 
> be improved with the addition of a private method to use for the getPathSize 
> recursion that takes a known FIleStatus object as a parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-681) HFileUtils. writeToHFilesForIncrementalLoad() should accept a FileSystem parameter

2019-04-18 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-681?focusedWorklogId=229912=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-229912
 ]

ASF GitHub Bot logged work on CRUNCH-681:
-

Author: ASF GitHub Bot
Created on: 18/Apr/19 20:56
Start Date: 18/Apr/19 20:56
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #22: CRUNCH-681: 
Updating HFileUtils to accept a filesystem parameter for …
URL: https://github.com/apache/crunch/pull/22#discussion_r276830917
 
 

 ##
 File path: 
crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HFileUtils.java
 ##
 @@ -384,6 +451,24 @@ public void process(Pair> input, 
Emitter emitter
 writeToHFilesForIncrementalLoad(cells, connection, tableName, outputPath, 
false);
   }
 
+  public static  void writeToHFilesForIncrementalLoad(
 
 Review comment:
   added, 
https://github.com/noslowerdna/crunch/commit/9dd88da343727acd4685e220f499f4fc21770dfe
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 229912)
Time Spent: 40m  (was: 0.5h)

> HFileUtils. writeToHFilesForIncrementalLoad() should accept a FileSystem 
> parameter
> --
>
> Key: CRUNCH-681
> URL: https://issues.apache.org/jira/browse/CRUNCH-681
> Project: Crunch
>  Issue Type: Improvement
>Affects Versions: 0.14.0
>Reporter: Ben Roling
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> With CRUNCH-677 in, HFileUtils.writeToHFilesForIncrementalLoad() should have 
> a form that accepts a FileSystem and propagates that FileSystem to the 
> HFileTarget.  This enables writing HFiles to FileSystems not included in the 
> Configuration of the Pipeline itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-681) HFileUtils. writeToHFilesForIncrementalLoad() should accept a FileSystem parameter

2019-04-18 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-681?focusedWorklogId=229749=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-229749
 ]

ASF GitHub Bot logged work on CRUNCH-681:
-

Author: ASF GitHub Bot
Created on: 18/Apr/19 15:31
Start Date: 18/Apr/19 15:31
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #22: CRUNCH-681: 
Updating HFileUtils to accept a filesystem parameter for …
URL: https://github.com/apache/crunch/pull/22
 
 
   …targets and sources
   
   Follow-up change that we missed in 
[CRUNCH-677](https://issues.apache.org/jira/browse/CRUNCH-677) - the creation 
of HFileSource and HFileTarget instances in HFileUtils should support usage of 
a supplied FileSystem instance.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 229749)
Time Spent: 10m
Remaining Estimate: 0h

> HFileUtils. writeToHFilesForIncrementalLoad() should accept a FileSystem 
> parameter
> --
>
> Key: CRUNCH-681
> URL: https://issues.apache.org/jira/browse/CRUNCH-681
> Project: Crunch
>  Issue Type: Improvement
>Affects Versions: 0.14.0
>Reporter: Ben Roling
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With CRUNCH-677 in, HFileUtils.writeToHFilesForIncrementalLoad() should have 
> a form that accepts a FileSystem and propagates that FileSystem to the 
> HFileTarget.  This enables writing HFiles to FileSystems not included in the 
> Configuration of the Pipeline itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-679) Improvements for usage of DistCp

2019-03-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-679?focusedWorklogId=209674=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-209674
 ]

ASF GitHub Bot logged work on CRUNCH-679:
-

Author: ASF GitHub Bot
Created on: 07/Mar/19 17:19
Start Date: 07/Mar/19 17:19
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #20: CRUNCH-679: 
Improvements for usage of DistCp
URL: https://github.com/apache/crunch/pull/20#discussion_r263484180
 
 

 ##
 File path: 
crunch-core/src/main/java/org/apache/crunch/util/CrunchRenameCopyListing.java
 ##
 @@ -0,0 +1,269 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information regarding copyright 
ownership.  The ASF licenses this file to you under the
+ * Apache License, Version 2.0 (the "License"); you may not use this file 
except in compliance with the License.  You may obtain a
+ * copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License is distributed on an "AS IS"
+ * BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied. See the License for the specific language
+ * governing permissions and limitations under the License.
+ */
+package org.apache.crunch.util;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.SequenceFile;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.security.Credentials;
+import org.apache.hadoop.tools.CopyListing;
+import org.apache.hadoop.tools.CopyListingFileStatus;
+import org.apache.hadoop.tools.DistCpOptions;
+import org.apache.hadoop.tools.DistCpOptions.FileAttribute;
+import org.apache.hadoop.tools.SimpleCopyListing;
+import org.apache.hadoop.tools.util.DistCpUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Stack;
+
+/**
+ * A custom {@link CopyListing} implementation capable of dynamically renaming
+ * the target paths according to a configured set of values.
+ * 
+ * Once https://issues.apache.org/jira/browse/HADOOP-16147 is available, this
+ * class can be significantly simplified.
+ * 
+ */
+public class CrunchRenameCopyListing extends SimpleCopyListing {
+  /**
+   * Comma-separated list of original-file:renamed-file path rename pairs.
+   */
+  public static final String DISTCP_PATH_RENAMES = 
"crunch.distcp.path.renames";
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(CrunchRenameCopyListing.class);
+  private final Map pathRenames;
+
+  private long totalPaths = 0;
+  private long totalBytesToCopy = 0;
+
+  /**
+   * Protected constructor, to initialize configuration.
+   *
+   * @param configuration The input configuration, with which the 
source/target FileSystems may be accessed.
+   * @param credentials - Credentials object on which the FS delegation tokens 
are cached. If null
+   * delegation token caching is skipped
+   */
+  protected CrunchRenameCopyListing(Configuration configuration, Credentials 
credentials) {
 
 Review comment:
   fixed: 
https://github.com/apache/crunch/pull/20/commits/8794d790b4b63444d61ccb013df8b4dffb1dc7d1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 209674)
Time Spent: 1h  (was: 50m)

> Improvements for usage of DistCp
> 
>
> Key: CRUNCH-679
> URL: https://issues.apache.org/jira/browse/CRUNCH-679
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and 
> improvements have been identified during testing.
> * We need to preserve preferred part names, e.g. part-m-0. Currently the 
> DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile 
> method, and would therefore create destination file names like out0-m-0, 
> which are problematic when there are multiple map-only jobs writing to the 
> same target path. This can be achieved by providing a custom CopyListing 
> implementation that 

[jira] [Work logged] (CRUNCH-679) Improvements for usage of DistCp

2019-03-06 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-679?focusedWorklogId=208986=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-208986
 ]

ASF GitHub Bot logged work on CRUNCH-679:
-

Author: ASF GitHub Bot
Created on: 06/Mar/19 17:13
Start Date: 06/Mar/19 17:13
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #20: CRUNCH-679: 
Improvements for usage of DistCp
URL: https://github.com/apache/crunch/pull/20#discussion_r263044336
 
 

 ##
 File path: 
crunch-core/src/main/java/org/apache/crunch/util/CrunchRenameCopyListing.java
 ##
 @@ -0,0 +1,269 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information regarding copyright 
ownership.  The ASF licenses this file to you under the
+ * Apache License, Version 2.0 (the "License"); you may not use this file 
except in compliance with the License.  You may obtain a
+ * copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License is distributed on an "AS IS"
+ * BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied. See the License for the specific language
+ * governing permissions and limitations under the License.
+ */
+package org.apache.crunch.util;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.SequenceFile;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.security.Credentials;
+import org.apache.hadoop.tools.CopyListing;
+import org.apache.hadoop.tools.CopyListingFileStatus;
+import org.apache.hadoop.tools.DistCpOptions;
+import org.apache.hadoop.tools.DistCpOptions.FileAttribute;
+import org.apache.hadoop.tools.SimpleCopyListing;
+import org.apache.hadoop.tools.util.DistCpUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Stack;
+
+/**
+ * A custom {@link CopyListing} implementation capable of dynamically renaming
+ * the target paths according to a configured set of values.
+ * 
+ * Once https://issues.apache.org/jira/browse/HADOOP-16147 is available, this
+ * class can be significantly simplified.
+ * 
+ */
+public class CrunchRenameCopyListing extends SimpleCopyListing {
+  /**
+   * Comma-separated list of original-file:renamed-file path rename pairs.
+   */
+  public static final String DISTCP_PATH_RENAMES = 
"crunch.distcp.path.renames";
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(CrunchRenameCopyListing.class);
+  private final Map pathRenames;
+
+  private long totalPaths = 0;
+  private long totalBytesToCopy = 0;
+
+  /**
+   * Protected constructor, to initialize configuration.
+   *
+   * @param configuration The input configuration, with which the 
source/target FileSystems may be accessed.
+   * @param credentials - Credentials object on which the FS delegation tokens 
are cached. If null
+   * delegation token caching is skipped
+   */
+  protected CrunchRenameCopyListing(Configuration configuration, Credentials 
credentials) {
 
 Review comment:
   This constructor needs to be public, otherwise it can fail with the 
following exception (if HADOOP_CLASSPATH isn't set)
   
   ```
   Caused by: java.io.IOException: Unable to instantiate 
org.apache.hadoop.tools.CrunchRenameCopyListing
at 
org.apache.hadoop.tools.CopyListing.getCopyListing(CopyListing.java:284)
at 
org.apache.hadoop.tools.CrunchDistCp.createInputFileListing(CrunchDistCp.java:430)
at 
org.apache.hadoop.tools.CrunchDistCp.prepareFileListing(CrunchDistCp.java:94)
at org.apache.hadoop.tools.CrunchDistCp.execute(CrunchDistCp.java:184)
at 
org.apache.crunch.io.impl.FileTargetImpl.handleOutputsDistributedCopy(FileTargetImpl.java:278)
... 11 more
   Caused by: java.lang.IllegalAccessException: Class 
org.apache.hadoop.tools.CopyListing can not access a member of class 
org.apache.hadoop.tools.CrunchRenameCopyListing with modifiers "protected"
at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:102)
at 
java.lang.reflect.AccessibleObject.slowCheckMemberAccess(AccessibleObject.java:296)
at 
java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:288)
at java.lang.reflect.Constructor.newInstance(Constructor.java:413)
at 
org.apache.hadoop.tools.CopyListing.getCopyListing(CopyListing.java:282)
... 15 more
   ```
 

This is an automated message from the Apache Git Service.
To respond to 

[jira] [Work logged] (CRUNCH-679) Improvements for usage of DistCp

2019-03-05 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-679?focusedWorklogId=207918=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-207918
 ]

ASF GitHub Bot logged work on CRUNCH-679:
-

Author: ASF GitHub Bot
Created on: 05/Mar/19 18:03
Start Date: 05/Mar/19 18:03
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #20: CRUNCH-679: 
Improvements for usage of DistCp
URL: https://github.com/apache/crunch/pull/20#discussion_r262615465
 
 

 ##
 File path: 
crunch-core/src/main/java/org/apache/crunch/util/CrunchRenameCopyListing.java
 ##
 @@ -0,0 +1,269 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information regarding copyright 
ownership.  The ASF licenses this file to you under the
+ * Apache License, Version 2.0 (the "License"); you may not use this file 
except in compliance with the License.  You may obtain a
+ * copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License is distributed on an "AS IS"
+ * BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied. See the License for the specific language
+ * governing permissions and limitations under the License.
+ */
+package org.apache.crunch.util;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.SequenceFile;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.security.Credentials;
+import org.apache.hadoop.tools.CopyListing;
+import org.apache.hadoop.tools.CopyListingFileStatus;
+import org.apache.hadoop.tools.DistCpOptions;
+import org.apache.hadoop.tools.DistCpOptions.FileAttribute;
+import org.apache.hadoop.tools.SimpleCopyListing;
+import org.apache.hadoop.tools.util.DistCpUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Stack;
+
+/**
+ * A custom {@link CopyListing} implementation capable of dynamically renaming
+ * the target paths according to a configured set of values.
+ * 
+ * Once https://issues.apache.org/jira/browse/HADOOP-16147 is available, this
+ * class can be significantly simplified.
+ * 
+ */
+public class CrunchRenameCopyListing extends SimpleCopyListing {
+  /**
+   * Comma-separated list of original-file:renamed-file path rename pairs.
+   */
+  public static final String DISTCP_PATH_RENAMES = 
"crunch.distcp.path.renames";
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(CrunchRenameCopyListing.class);
+  private final Map pathRenames;
+
+  private long totalPaths = 0;
+  private long totalBytesToCopy = 0;
+
+  /**
+   * Protected constructor, to initialize configuration.
+   *
+   * @param configuration The input configuration, with which the 
source/target FileSystems may be accessed.
+   * @param credentials - Credentials object on which the FS delegation tokens 
are cached. If null
+   * delegation token caching is skipped
+   */
+  protected CrunchRenameCopyListing(Configuration configuration, Credentials 
credentials) {
+super(configuration, credentials);
+
+pathRenames = new HashMap<>();
+
+String[] pathRenameConf = configuration.getStrings(DISTCP_PATH_RENAMES);
 
 Review comment:
   I ran some tests and did not encounter any problems I would consider 
showstopping. My largest concern is it appeared that the DistCp job completion 
time was delayed roughly proportionally to the size of the configuration, with 
a delay of nearly two minutes observed in the worst case test - I think 
probably due to persisting the configuration into job history. 
   
   ```
   19/03/05 11:04:53 INFO mapreduce.Job:  map 100% reduce 0%
   19/03/05 11:06:48 INFO mapreduce.Job: Job job_1538164922410_717171 completed 
successfully
   ```
   
   Otherwise everything looked good, and viewing the job configuration through 
the web UI worked fine although slow when the size was in multiple MBs.
   
   The configuration property value size was retrieved by,
   ```
   conf.get(CrunchRenameCopyListing.DISTCP_PATH_RENAMES).length()
   ```
   
   Number of part files | Configuration size | Completion delay
   -- | -- | --
   10k | 260kb | 0:05
   50k | 1.3mb | 0:29
   250k | 6.8mb | 0:54
   500k | 13.8mb | 1:55
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Work logged] (CRUNCH-679) Improvements for usage of DistCp

2019-03-01 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-679?focusedWorklogId=206528=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-206528
 ]

ASF GitHub Bot logged work on CRUNCH-679:
-

Author: ASF GitHub Bot
Created on: 01/Mar/19 17:09
Start Date: 01/Mar/19 17:09
Worklog Time Spent: 10m 
  Work Description: mkwhitacre commented on pull request #20: CRUNCH-679: 
Improvements for usage of DistCp
URL: https://github.com/apache/crunch/pull/20#discussion_r261682765
 
 

 ##
 File path: 
crunch-core/src/main/java/org/apache/crunch/util/CrunchRenameCopyListing.java
 ##
 @@ -0,0 +1,261 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information regarding copyright 
ownership.  The ASF licenses this file to you under the
+ * Apache License, Version 2.0 (the "License"); you may not use this file 
except in compliance with the License.  You may obtain a
+ * copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License is distributed on an "AS IS"
+ * BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied. See the License for the specific language
+ * governing permissions and limitations under the License.
+ */
+package org.apache.crunch.util;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.SequenceFile;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.security.Credentials;
+import org.apache.hadoop.tools.CopyListing;
+import org.apache.hadoop.tools.CopyListingFileStatus;
+import org.apache.hadoop.tools.DistCpOptions;
+import org.apache.hadoop.tools.DistCpOptions.FileAttribute;
+import org.apache.hadoop.tools.SimpleCopyListing;
+import org.apache.hadoop.tools.util.DistCpUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Stack;
+
+/**
+ * A custom {@link CopyListing} implementation capable of dynamically renaming
+ * the target paths according to a configured set of values.
+ * 
+ * Once https://issues.apache.org/jira/browse/HADOOP-16147 is available, this
+ * class can be significantly simplified.
+ * 
+ */
+public class CrunchRenameCopyListing extends SimpleCopyListing {
+  /**
+   * Comma-separated list of original-file:renamed-file path rename pairs.
+   */
+  public static final String DISTCP_PATH_RENAMES = 
"crunch.distcp.path.renames";
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(CrunchRenameCopyListing.class);
+  private final Map pathRenames;
+
+  private long totalPaths = 0;
+  private long totalBytesToCopy = 0;
+
+  /**
+   * Protected constructor, to initialize configuration.
+   *
+   * @param configuration The input configuration, with which the 
source/target FileSystems may be accessed.
+   * @param credentials - Credentials object on which the FS delegation tokens 
are cached. If null
+   * delegation token caching is skipped
+   */
+  protected CrunchRenameCopyListing(Configuration configuration, Credentials 
credentials) {
+super(configuration, credentials);
+
+pathRenames = new HashMap<>();
+
+String[] pathRenameConf = configuration.getStrings(DISTCP_PATH_RENAMES);
+if (pathRenameConf == null) {
+  throw new IllegalArgumentException("Missing required configuration: " + 
DISTCP_PATH_RENAMES);
+}
+for (String pathRename : pathRenameConf) {
+  String[] pathRenameParts = pathRename.split(":");
+  if (pathRenameParts.length != 2) {
+throw new IllegalArgumentException("Invalid path rename format: " + 
pathRename);
+  }
+  if (pathRenames.put(pathRenameParts[0], pathRenameParts[1]) != null) {
+throw new IllegalArgumentException("Invalid duplicate path rename: " + 
pathRenameParts[0]);
+  }
+}
+LOG.info("Loaded {} path rename entries", pathRenames.size());
+  }
+
+  @Override
+  public void doBuildListing(SequenceFile.Writer fileListWriter, DistCpOptions 
options) throws IOException {
+try {
+  for (Path path : options.getSourcePaths()) {
+FileSystem sourceFS = path.getFileSystem(getConf());
+final boolean preserveAcls = options.shouldPreserve(FileAttribute.ACL);
+final boolean preserveXAttrs = 
options.shouldPreserve(FileAttribute.XATTR);
+final boolean preserveRawXAttrs = options.shouldPreserveRawXattrs();
+path = makeQualified(path);
+
+FileStatus rootStatus = sourceFS.getFileStatus(path);
+Path sourcePathRoot = computeSourceRootPath(rootStatus, options);
+
+FileStatus[] 

[jira] [Work logged] (CRUNCH-680) Kafka Source should split very large partitions

2019-03-01 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-680?focusedWorklogId=206520=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-206520
 ]

ASF GitHub Bot logged work on CRUNCH-680:
-

Author: ASF GitHub Bot
Created on: 01/Mar/19 16:40
Start Date: 01/Mar/19 16:40
Worklog Time Spent: 10m 
  Work Description: mkwhitacre commented on pull request #21: CRUNCH-680: 
Kafka Source should split very large partitions
URL: https://github.com/apache/crunch/pull/21
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 206520)
Time Spent: 20m  (was: 10m)

> Kafka Source should split very large partitions
> ---
>
> Key: CRUNCH-680
> URL: https://issues.apache.org/jira/browse/CRUNCH-680
> Project: Crunch
>  Issue Type: Improvement
>  Components: IO
>Reporter: Andrew Olson
>Assignee: Micah Whitacre
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If a single Kafka partition has a very large number of messages, the map task 
> for that partition can take a long time to run leading to potential timeout 
> problems. We should limit the number of messages assigned to each split so 
> that the workload is more evenly balanced.
> Based on our testing we have determined that 5 million messages should be a 
> generally reasonable default for the maximum split size, with a configuration 
> property (org.apache.crunch.kafka.split.max) provided to optionally override 
> that value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-680) Kafka Source should split very large partitions

2019-02-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-680?focusedWorklogId=204840=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-204840
 ]

ASF GitHub Bot logged work on CRUNCH-680:
-

Author: ASF GitHub Bot
Created on: 26/Feb/19 23:03
Start Date: 26/Feb/19 23:03
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #21: CRUNCH-680: 
Kafka Source should split very large partitions
URL: https://github.com/apache/crunch/pull/21
 
 
   Introduces relatively straightforward logic to chunk very large partitions 
into multiple splits. Some missing unit tests were also added in.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 204840)
Time Spent: 10m
Remaining Estimate: 0h

> Kafka Source should split very large partitions
> ---
>
> Key: CRUNCH-680
> URL: https://issues.apache.org/jira/browse/CRUNCH-680
> Project: Crunch
>  Issue Type: Improvement
>  Components: IO
>Reporter: Andrew Olson
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If a single Kafka partition has a very large number of messages, the map task 
> for that partition can take a long time to run leading to potential timeout 
> problems. We should limit the number of messages assigned to each split so 
> that the workload is more evenly balanced.
> Based on our testing we have determined that 5 million messages should be a 
> generally reasonable default for the maximum split size, with a configuration 
> property (org.apache.crunch.kafka.split.max) provided to optionally override 
> that value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-679) Improvements for usage of DistCp

2019-02-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-679?focusedWorklogId=204700=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-204700
 ]

ASF GitHub Bot logged work on CRUNCH-679:
-

Author: ASF GitHub Bot
Created on: 26/Feb/19 19:33
Start Date: 26/Feb/19 19:33
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #20: CRUNCH-679: 
Improvements for usage of DistCp
URL: https://github.com/apache/crunch/pull/20
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 204700)
Time Spent: 10m
Remaining Estimate: 0h

> Improvements for usage of DistCp
> 
>
> Key: CRUNCH-679
> URL: https://issues.apache.org/jira/browse/CRUNCH-679
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and 
> improvements have been identified during testing.
> * We need to preserve preferred part names, e.g. part-m-0. Currently the 
> DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile 
> method, and would therefore create destination file names like out0-m-0, 
> which are problematic when there are multiple map-only jobs writing to the 
> same target path. This can be achieved by providing a custom CopyListing 
> implementation that is capable of dynamically renaming target paths based on 
> a given mapping. Unfortunately a substantial amount of code duplication from 
> the original SimpleCopyListing class is currently required in order to inject 
> the necessary logic for modifying the sequence file entry keys. HADOOP-16147 
> has been opened to allow it to be simplified in the future.
> * The handleOutputs implementation in HFileTarget is essentially identical to 
> the one in FileTargetImpl that it overrides. We can remove it and just share 
> the same code.
> * It could be useful to add a property for configuring the max DistCp task 
> bandwidth, as the default (100 MB/s per task) may be too high for certain 
> environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-677) Support passing FileSystem to File Sources and Targets

2019-02-21 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-677?focusedWorklogId=202097=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-202097
 ]

ASF GitHub Bot logged work on CRUNCH-677:
-

Author: ASF GitHub Bot
Created on: 21/Feb/19 17:18
Start Date: 21/Feb/19 17:18
Worklog Time Spent: 10m 
  Work Description: ben-roling commented on issue #19: CRUNCH-677 Source 
and Target accept FileSystem
URL: https://github.com/apache/crunch/pull/19#issuecomment-466086262
 
 
   Merge mistakes are now fixed and build is passing.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 202097)
Time Spent: 40m  (was: 0.5h)

> Support passing FileSystem to File Sources and Targets
> --
>
> Key: CRUNCH-677
> URL: https://issues.apache.org/jira/browse/CRUNCH-677
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Ben Roling
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We'd like to pass a FileSystem instance to File Sources and Targets to fully 
> qualify the Path.  Without the FileSystem, the Pipeline doesn't necessarily 
> have enough information to understand the Path.  In particular, when the Path 
> is an HA HDFS path like "hdfs://my-cluster/data", the Pipeline might not have 
> the 
> [configuration|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details]
>  to resolve "hdfs://my-cluster".
> While it is in some cases possible to seed the Pipeline configuration with 
> all the HDFS properties necessary to communicate with any HDFS HA cluster the 
> Pipeline might talk to, it can be awkward and/or difficult to do this in all 
> cases.  We have cases where we'd like not to have to know all of the clusters 
> upfront.
> With the proposed change, code like the following is possible, where 
> {{readFileSystem}} and {{writeFileSystem}} are external FileSystems 
> synthesized from Configuration completely separate from that used to 
> construct the Pipeline itself:
> {code}
> Configuration emptyConfiguration = new Configuration(false);
> Pipeline pipeline = new MRPipeline(getClass(), emptyConfiguration);
> FileSystem readFileSystem = ...;
> PCollection data = 
> pipeline.read(From.textFile("hdfs://my-cluster-1/data", readFileSystem));
> FileSystem writeFileSystem = ...;
> pipeline.write(data, To.textFile("hdfs://my-cluster-2/output", 
> writeFileSystem));
> {code}
> Note: the hdfs://my-cluster-1 and hdfs://my-cluster-2 parts of the paths 
> would not strictly need to be included as they would be implied by the 
> FileSystem instances passed in the calls.  As such the paths could simply be 
> passed as "/data" and "/output" with equivalent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-677) Support passing FileSystem to File Sources and Targets

2019-02-21 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-677?focusedWorklogId=202059=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-202059
 ]

ASF GitHub Bot logged work on CRUNCH-677:
-

Author: ASF GitHub Bot
Created on: 21/Feb/19 16:37
Start Date: 21/Feb/19 16:37
Worklog Time Spent: 10m 
  Work Description: ben-roling commented on pull request #19: CRUNCH-677 
Source and Target accept FileSystem
URL: https://github.com/apache/crunch/pull/19#discussion_r259011894
 
 

 ##
 File path: 
crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
 ##
 @@ -178,7 +206,7 @@ public void handleOutputs(Configuration conf, Path 
workingPath, int index) throw
   if (useDistributedCopy) {
 LOG.info("Source and destination are in different file systems, 
performing distributed copy from {} to {}", srcPattern,
 path);
-handeOutputsDistributedCopy(conf, srcPattern, srcFs, dstFs, 
maxDistributedCopyTasks);
+handleOutputsDistributedCopy(dstFsConf, srcPattern, srcFs, dstFs, 
maxDistributedCopyTasks);
 
 Review comment:
   This is a cherry-pick merge mistake causing the build to fail.  I'll fix in 
a second and make sure the build and all tests are passing.   Btw, this was 
originally developed on an internal fork and reviewed with my colleagues, 
@noslowerdna and @mkwhitacre.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 202059)
Time Spent: 0.5h  (was: 20m)

> Support passing FileSystem to File Sources and Targets
> --
>
> Key: CRUNCH-677
> URL: https://issues.apache.org/jira/browse/CRUNCH-677
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Ben Roling
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We'd like to pass a FileSystem instance to File Sources and Targets to fully 
> qualify the Path.  Without the FileSystem, the Pipeline doesn't necessarily 
> have enough information to understand the Path.  In particular, when the Path 
> is an HA HDFS path like "hdfs://my-cluster/data", the Pipeline might not have 
> the 
> [configuration|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details]
>  to resolve "hdfs://my-cluster".
> While it is in some cases possible to seed the Pipeline configuration with 
> all the HDFS properties necessary to communicate with any HDFS HA cluster the 
> Pipeline might talk to, it can be awkward and/or difficult to do this in all 
> cases.  We have cases where we'd like not to have to know all of the clusters 
> upfront.
> With the proposed change, code like the following is possible, where 
> {{readFileSystem}} and {{writeFileSystem}} are external FileSystems 
> synthesized from Configuration completely separate from that used to 
> construct the Pipeline itself:
> {code}
> Configuration emptyConfiguration = new Configuration(false);
> Pipeline pipeline = new MRPipeline(getClass(), emptyConfiguration);
> FileSystem readFileSystem = ...;
> PCollection data = 
> pipeline.read(From.textFile("hdfs://my-cluster-1/data", readFileSystem));
> FileSystem writeFileSystem = ...;
> pipeline.write(data, To.textFile("hdfs://my-cluster-2/output", 
> writeFileSystem));
> {code}
> Note: the hdfs://my-cluster-1 and hdfs://my-cluster-2 parts of the paths 
> would not strictly need to be included as they would be implied by the 
> FileSystem instances passed in the calls.  As such the paths could simply be 
> passed as "/data" and "/output" with equivalent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-677) Support passing FileSystem to File Sources and Targets

2019-02-21 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-677?focusedWorklogId=202056=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-202056
 ]

ASF GitHub Bot logged work on CRUNCH-677:
-

Author: ASF GitHub Bot
Created on: 21/Feb/19 16:36
Start Date: 21/Feb/19 16:36
Worklog Time Spent: 10m 
  Work Description: ben-roling commented on pull request #19: CRUNCH-677 
Source and Target accept FileSystem
URL: https://github.com/apache/crunch/pull/19#discussion_r259011894
 
 

 ##
 File path: 
crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
 ##
 @@ -178,7 +206,7 @@ public void handleOutputs(Configuration conf, Path 
workingPath, int index) throw
   if (useDistributedCopy) {
 LOG.info("Source and destination are in different file systems, 
performing distributed copy from {} to {}", srcPattern,
 path);
-handeOutputsDistributedCopy(conf, srcPattern, srcFs, dstFs, 
maxDistributedCopyTasks);
+handleOutputsDistributedCopy(dstFsConf, srcPattern, srcFs, dstFs, 
maxDistributedCopyTasks);
 
 Review comment:
   This is a cherry-pick merge mistake causing the build to fail.  I'll fix in 
a second and make sure the build and all tests are passing.   This was 
originally developed on an internal fork and reviewed with my colleagues, 
@noslowerdna and @mkwhitacre.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 202056)
Time Spent: 20m  (was: 10m)

> Support passing FileSystem to File Sources and Targets
> --
>
> Key: CRUNCH-677
> URL: https://issues.apache.org/jira/browse/CRUNCH-677
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Ben Roling
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We'd like to pass a FileSystem instance to File Sources and Targets to fully 
> qualify the Path.  Without the FileSystem, the Pipeline doesn't necessarily 
> have enough information to understand the Path.  In particular, when the Path 
> is an HA HDFS path like "hdfs://my-cluster/data", the Pipeline might not have 
> the 
> [configuration|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details]
>  to resolve "hdfs://my-cluster".
> While it is in some cases possible to seed the Pipeline configuration with 
> all the HDFS properties necessary to communicate with any HDFS HA cluster the 
> Pipeline might talk to, it can be awkward and/or difficult to do this in all 
> cases.  We have cases where we'd like not to have to know all of the clusters 
> upfront.
> With the proposed change, code like the following is possible, where 
> {{readFileSystem}} and {{writeFileSystem}} are external FileSystems 
> synthesized from Configuration completely separate from that used to 
> construct the Pipeline itself:
> {code}
> Configuration emptyConfiguration = new Configuration(false);
> Pipeline pipeline = new MRPipeline(getClass(), emptyConfiguration);
> FileSystem readFileSystem = ...;
> PCollection data = 
> pipeline.read(From.textFile("hdfs://my-cluster-1/data", readFileSystem));
> FileSystem writeFileSystem = ...;
> pipeline.write(data, To.textFile("hdfs://my-cluster-2/output", 
> writeFileSystem));
> {code}
> Note: the hdfs://my-cluster-1 and hdfs://my-cluster-2 parts of the paths 
> would not strictly need to be included as they would be implied by the 
> FileSystem instances passed in the calls.  As such the paths could simply be 
> passed as "/data" and "/output" with equivalent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-677) Support passing FileSystem to File Sources and Targets

2019-02-20 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-677?focusedWorklogId=201490=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-201490
 ]

ASF GitHub Bot logged work on CRUNCH-677:
-

Author: ASF GitHub Bot
Created on: 20/Feb/19 18:05
Start Date: 20/Feb/19 18:05
Worklog Time Spent: 10m 
  Work Description: ben-roling commented on pull request #19: CRUNCH-677 
Source and Target accept FileSystem
URL: https://github.com/apache/crunch/pull/19
 
 
   The change to Source, Target, and SourceTarget obviously breaks 
compatibility for implementors of these interfaces as we're still building 
against Java 7 so I can't provide a default implementation for these new 
methods.
   
   Also, there is an expectation that Target implementations will need to 
update the asSourceTarget() method to copy the FileSystem along.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 201490)
Time Spent: 10m
Remaining Estimate: 0h

> Support passing FileSystem to File Sources and Targets
> --
>
> Key: CRUNCH-677
> URL: https://issues.apache.org/jira/browse/CRUNCH-677
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Ben Roling
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We'd like to pass a FileSystem instance to File Sources and Targets to fully 
> qualify the Path.  Without the FileSystem, the Pipeline doesn't necessarily 
> have enough information to understand the Path.  In particular, when the Path 
> is an HA HDFS path like "hdfs://my-cluster/data", the Pipeline might not have 
> the 
> [configuration|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details]
>  to resolve "hdfs://my-cluster".
> While it is in some cases possible to seed the Pipeline configuration with 
> all the HDFS properties necessary to communicate with any HDFS HA cluster the 
> Pipeline might talk to, it can be awkward and/or difficult to do this in all 
> cases.  We have cases where we'd like not to have to know all of the clusters 
> upfront.
> With the proposed change, code like the following is possible, where 
> {{readFileSystem}} and {{writeFileSystem}} are external FileSystems 
> synthesized from Configuration completely separate from that used to 
> construct the Pipeline itself:
> {code}
> Configuration emptyConfiguration = new Configuration(false);
> Pipeline pipeline = new MRPipeline(getClass(), emptyConfiguration);
> FileSystem readFileSystem = ...;
> PCollection data = 
> pipeline.read(From.textFile("hdfs://my-cluster-1/data", readFileSystem));
> FileSystem writeFileSystem = ...;
> pipeline.write(data, To.textFile("hdfs://my-cluster-2/output", 
> writeFileSystem));
> {code}
> Note: the hdfs://my-cluster-1 and hdfs://my-cluster-2 parts of the paths 
> would not strictly need to be included as they would be implied by the 
> FileSystem instances passed in the calls.  As such the paths could simply be 
> passed as "/data" and "/output" with equivalent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-678) Avoid unnecessary retrieval of last modified time

2019-02-20 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-678?focusedWorklogId=201376=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-201376
 ]

ASF GitHub Bot logged work on CRUNCH-678:
-

Author: ASF GitHub Bot
Created on: 20/Feb/19 15:33
Start Date: 20/Feb/19 15:33
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on issue #18: CRUNCH-678: Avoid 
unnecessary last modified time retrieval
URL: https://github.com/apache/crunch/pull/18#issuecomment-465627513
 
 
   Merged here: 
https://github.com/apache/crunch/commit/571b90c03e3010e7bb9badf4e6e441ab2164be56
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 201376)
Time Spent: 0.5h  (was: 20m)

> Avoid unnecessary retrieval of last modified time
> -
>
> Key: CRUNCH-678
> URL: https://issues.apache.org/jira/browse/CRUNCH-678
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> There is no assurance that the last modified time can be retrieved 
> efficiently for all file systems. In particular, with object stores and large 
> data sets it could be very slow. Since this information is actually not 
> always needed, we should only retrieve it when necessary (i.e. when the write 
> mode is checkpoint) for sources and targets.
> CRUNCH-658 expressed similar concerns for the getSize method. This would be a 
> simpler and safer optimization to make.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-678) Avoid unnecessary retrieval of last modified time

2019-02-20 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-678?focusedWorklogId=201377=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-201377
 ]

ASF GitHub Bot logged work on CRUNCH-678:
-

Author: ASF GitHub Bot
Created on: 20/Feb/19 15:33
Start Date: 20/Feb/19 15:33
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #18: CRUNCH-678: 
Avoid unnecessary last modified time retrieval
URL: https://github.com/apache/crunch/pull/18
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 201377)
Time Spent: 40m  (was: 0.5h)

> Avoid unnecessary retrieval of last modified time
> -
>
> Key: CRUNCH-678
> URL: https://issues.apache.org/jira/browse/CRUNCH-678
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There is no assurance that the last modified time can be retrieved 
> efficiently for all file systems. In particular, with object stores and large 
> data sets it could be very slow. Since this information is actually not 
> always needed, we should only retrieve it when necessary (i.e. when the write 
> mode is checkpoint) for sources and targets.
> CRUNCH-658 expressed similar concerns for the getSize method. This would be a 
> simpler and safer optimization to make.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-678) Avoid unnecessary retrieval of last modified time

2019-02-20 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-678?focusedWorklogId=201378=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-201378
 ]

ASF GitHub Bot logged work on CRUNCH-678:
-

Author: ASF GitHub Bot
Created on: 20/Feb/19 15:34
Start Date: 20/Feb/19 15:34
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on issue #18: CRUNCH-678: Avoid 
unnecessary last modified time retrieval
URL: https://github.com/apache/crunch/pull/18#issuecomment-465627513
 
 
   Committed here: 
https://github.com/apache/crunch/commit/571b90c03e3010e7bb9badf4e6e441ab2164be56
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 201378)
Time Spent: 50m  (was: 40m)

> Avoid unnecessary retrieval of last modified time
> -
>
> Key: CRUNCH-678
> URL: https://issues.apache.org/jira/browse/CRUNCH-678
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There is no assurance that the last modified time can be retrieved 
> efficiently for all file systems. In particular, with object stores and large 
> data sets it could be very slow. Since this information is actually not 
> always needed, we should only retrieve it when necessary (i.e. when the write 
> mode is checkpoint) for sources and targets.
> CRUNCH-658 expressed similar concerns for the getSize method. This would be a 
> simpler and safer optimization to make.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-678) Avoid unnecessary retrieval of last modified time

2019-02-19 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-678?focusedWorklogId=200991=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-200991
 ]

ASF GitHub Bot logged work on CRUNCH-678:
-

Author: ASF GitHub Bot
Created on: 20/Feb/19 00:26
Start Date: 20/Feb/19 00:26
Worklog Time Spent: 10m 
  Work Description: jwills commented on issue #18: CRUNCH-678: Avoid 
unnecessary last modified time retrieval
URL: https://github.com/apache/crunch/pull/18#issuecomment-465367798
 
 
   And merged-- thank you Andrew!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 200991)
Time Spent: 20m  (was: 10m)

> Avoid unnecessary retrieval of last modified time
> -
>
> Key: CRUNCH-678
> URL: https://issues.apache.org/jira/browse/CRUNCH-678
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There is no assurance that the last modified time can be retrieved 
> efficiently for all file systems. In particular, with object stores and large 
> data sets it could be very slow. Since this information is actually not 
> always needed, we should only retrieve it when necessary (i.e. when the write 
> mode is checkpoint) for sources and targets.
> CRUNCH-658 expressed similar concerns for the getSize method. This would be a 
> simpler and safer optimization to make.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-678) Avoid unnecessary retrieval of last modified time

2019-02-19 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-678?focusedWorklogId=200971=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-200971
 ]

ASF GitHub Bot logged work on CRUNCH-678:
-

Author: ASF GitHub Bot
Created on: 19/Feb/19 23:28
Start Date: 19/Feb/19 23:28
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #18: CRUNCH-678: 
Avoid unnecessary last modified time retrieval
URL: https://github.com/apache/crunch/pull/18
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 200971)
Time Spent: 10m
Remaining Estimate: 0h

> Avoid unnecessary retrieval of last modified time
> -
>
> Key: CRUNCH-678
> URL: https://issues.apache.org/jira/browse/CRUNCH-678
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Andrew Olson
>Assignee: Josh Wills
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is no assurance that the last modified time can be retrieved 
> efficiently for all file systems. In particular, with object stores and large 
> data sets it could be very slow. Since this information is actually not 
> always needed, we should only retrieve it when necessary (i.e. when the write 
> mode is checkpoint) for sources and targets.
> CRUNCH-658 expressed similar concerns for the getSize method. This would be a 
> simpler and safer optimization to make.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (CRUNCH-660) FileTargetImpl uses Distcp vs FileUtils.copy

2019-02-19 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/CRUNCH-660?focusedWorklogId=200837=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-200837
 ]

ASF GitHub Bot logged work on CRUNCH-660:
-

Author: ASF GitHub Bot
Created on: 19/Feb/19 19:42
Start Date: 19/Feb/19 19:42
Worklog Time Spent: 10m 
  Work Description: noslowerdna commented on pull request #17: CRUNCH-660, 
CRUNCH-675: Use DistCp instead of FileUtils.copy when sou…
URL: https://github.com/apache/crunch/pull/17
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 200837)
Time Spent: 20m  (was: 10m)

> FileTargetImpl uses Distcp vs FileUtils.copy
> 
>
> Key: CRUNCH-660
> URL: https://issues.apache.org/jira/browse/CRUNCH-660
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Reporter: Micah Whitacre
>Assignee: Josh Wills
>Priority: Major
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> So for handling multiple runtimes I'm not sure there is a way to solve this 
> but documenting as a JIRA regardless.
> If you are running in a multi-cluster environment where you might want to 
> read data from one cluster and then write the output on another cluster (e.g. 
> generating HFiles to be loaded into a separate HBase cluster), the 
> performance of moving files is noticeable.  Specifically due to the fact that 
> the moving of the files happens in the launcher/driver process versus as part 
> of the node execution it seems.[1]
> An efficient option would be to kick off a DistCp instead but that would tie 
> the target directly to a runtime which is not a great approach.  
> [1] - 
> https://github.com/apache/crunch/blob/5609b014378d3460a55ce25522f0c00659872807/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java#L157



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)