Re: preferredlocations for hadoopfsrelations based baseRelations

2020-06-29 Thread Steve Loughran
Here's a class which lets you proved a function on a row by row basis to
declare location

https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala

needs to be in o.a.spark as something you need is scoped to the spark
packages only.

I used it for a PoC of a distcp replacement -each row was a filename, so
the locations of each row was the server with the first block of the file
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala#L137

it would be convenient if either the bits of the API I needed was public or
the extra RDD code just went in somewhere. It's nothing complicated

On Thu, 4 Jun 2020 at 09:31, ZHANG Wei  wrote:

> AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()`
> method, which is ordered by the data size, to get the partition
> preferred locations. If there are other vectors to sort, I'm wondering
> if here[1] can be a place to add. Or inheriting class `FilePartition`
> with overridden `preferredLocations()` might also work.
>
> --
> Cheers,
> -z
> [1]
> https://github.com/apache/spark/blob/a4195d28ae94793b793641f121e21982bf3880d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L43
>
> On Thu, 4 Jun 2020 06:40:43 +
> Nasrulla Khan Haris  wrote:
>
> > HI Spark developers,
> >
> > I have created new format extending fileformat. I see
> getPrefferedLocations is available if newCustomRDD is created. Since
> fileformat is based off FileScanRDD which uses readfile method to read
> partitioned file, Is there a way to add desired preferredLocations ?
> >
> > Appreciate your responses.
> >
> > Thanks,
> > NKH
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: preferredlocations for hadoopfsrelations based baseRelations

2020-06-04 Thread ZHANG Wei
AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()`
method, which is ordered by the data size, to get the partition
preferred locations. If there are other vectors to sort, I'm wondering
if here[1] can be a place to add. Or inheriting class `FilePartition`
with overridden `preferredLocations()` might also work.

-- 
Cheers,
-z
[1] 
https://github.com/apache/spark/blob/a4195d28ae94793b793641f121e21982bf3880d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L43

On Thu, 4 Jun 2020 06:40:43 +
Nasrulla Khan Haris  wrote:

> HI Spark developers,
> 
> I have created new format extending fileformat. I see getPrefferedLocations 
> is available if newCustomRDD is created. Since fileformat is based off 
> FileScanRDD which uses readfile method to read partitioned file, Is there a 
> way to add desired preferredLocations ?
> 
> Appreciate your responses.
> 
> Thanks,
> NKH
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



preferredlocations for hadoopfsrelations based baseRelations

2020-06-03 Thread Nasrulla Khan Haris
HI Spark developers,

I have created new format extending fileformat. I see getPrefferedLocations is 
available if newCustomRDD is created. Since fileformat is based off FileScanRDD 
which uses readfile method to read partitioned file, Is there a way to add 
desired preferredLocations ?

Appreciate your responses.

Thanks,
NKH