Re: [PR] feat: prevent export of results from non-downloadable datasets [texera]

via GitHub Tue, 30 Sep 2025 16:40:24 -0700


kunwp1 commented on code in PR #3772:
URL: https://github.com/apache/texera/pull/3772#discussion_r2393010589



##########
core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala:
##########
@@ -102,6 +103,179 @@ object WorkflowExecutionsResource {
     }
   }
 
+  /**
+    * Represents a dataset that has access restrictions for export.
+    * Used to track which datasets are non-downloadable and owned by other 
users.
+    *
+    * @param ownerEmail The email of the dataset owner
+    * @param datasetName The name of the dataset
+    */
+  private case class RestrictedDataset(ownerEmail: String, datasetName: 
String) {
+    def cacheKey: (String, String) = (ownerEmail.toLowerCase, 
datasetName.toLowerCase)
+    def label: String = s"$datasetName ($ownerEmail)"
+  }
+
+  /**
+    * Parses a file path to extract dataset information.
+    * Expected format: /ownerEmail/datasetName/...
+    *
+    * @param path The file path from operator properties
+    * @return Some(RestrictedDataset) if path is valid, None otherwise
+    */
+  private def parseDatasetPath(path: String): Option[RestrictedDataset] = {
+    if (path == null) {
+      return None
+    }
+    val trimmed = path.trim
+    if (!trimmed.startsWith("/")) {
+      return None
+    }
+    val segments = trimmed.split("/").filter(_.nonEmpty)
+    if (segments.length < 4) {
+      return None
+    }
+    val ownerEmail = segments(0)
+    val datasetName = segments(1)
+    Some(RestrictedDataset(ownerEmail, datasetName))
+  }
+
+  /**
+    * Checks if a dataset is downloadable by querying the database.
+    * Uses caching to avoid repeated database queries for the same dataset.
+    *
+    * @param dataset The dataset to check
+    * @param cache A cache to store lookup results
+    * @return Some(true) if downloadable, Some(false) if not, None if dataset 
doesn't exist
+    */
+  private def lookupDatasetDownloadable(

Review Comment:
   Maybe rename it to `isDownloadableDataset`



##########
core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala:
##########
@@ -102,6 +103,179 @@ object WorkflowExecutionsResource {
     }
   }
 
+  /**
+    * Represents a dataset that has access restrictions for export.
+    * Used to track which datasets are non-downloadable and owned by other 
users.
+    *
+    * @param ownerEmail The email of the dataset owner
+    * @param datasetName The name of the dataset
+    */
+  private case class RestrictedDataset(ownerEmail: String, datasetName: 
String) {

Review Comment:
   I prefer not to define this new class because the code only uses `cacheKey` 
and  `label` once. I think it is better to just dynamically create those with 
`ownerEmail` and `datasetName` instead. Also, these labels are only used in the 
frontend. You should let the label created from the frontend, not from the 
backend.



##########
core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala:
##########
@@ -102,6 +103,179 @@ object WorkflowExecutionsResource {
     }
   }
 
+  /**
+    * Represents a dataset that has access restrictions for export.
+    * Used to track which datasets are non-downloadable and owned by other 
users.
+    *
+    * @param ownerEmail The email of the dataset owner
+    * @param datasetName The name of the dataset
+    */
+  private case class RestrictedDataset(ownerEmail: String, datasetName: 
String) {
+    def cacheKey: (String, String) = (ownerEmail.toLowerCase, 
datasetName.toLowerCase)
+    def label: String = s"$datasetName ($ownerEmail)"
+  }
+
+  /**
+    * Parses a file path to extract dataset information.
+    * Expected format: /ownerEmail/datasetName/...
+    *
+    * @param path The file path from operator properties
+    * @return Some(RestrictedDataset) if path is valid, None otherwise
+    */
+  private def parseDatasetPath(path: String): Option[RestrictedDataset] = {

Review Comment:
   I think this function should be moved to a file in the backend codebase that 
creates the dataset path with the expected format. 



##########
core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala:
##########
@@ -102,6 +103,179 @@ object WorkflowExecutionsResource {
     }
   }
 
+  /**
+    * Represents a dataset that has access restrictions for export.
+    * Used to track which datasets are non-downloadable and owned by other 
users.
+    *
+    * @param ownerEmail The email of the dataset owner
+    * @param datasetName The name of the dataset
+    */
+  private case class RestrictedDataset(ownerEmail: String, datasetName: 
String) {
+    def cacheKey: (String, String) = (ownerEmail.toLowerCase, 
datasetName.toLowerCase)

Review Comment:
   Why did you use `toLowerCase`?



##########
core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala:
##########
@@ -102,6 +103,179 @@ object WorkflowExecutionsResource {
     }
   }
 
+  /**
+    * Represents a dataset that has access restrictions for export.
+    * Used to track which datasets are non-downloadable and owned by other 
users.
+    *
+    * @param ownerEmail The email of the dataset owner
+    * @param datasetName The name of the dataset
+    */
+  private case class RestrictedDataset(ownerEmail: String, datasetName: 
String) {
+    def cacheKey: (String, String) = (ownerEmail.toLowerCase, 
datasetName.toLowerCase)
+    def label: String = s"$datasetName ($ownerEmail)"
+  }
+
+  /**
+    * Parses a file path to extract dataset information.
+    * Expected format: /ownerEmail/datasetName/...
+    *
+    * @param path The file path from operator properties
+    * @return Some(RestrictedDataset) if path is valid, None otherwise
+    */
+  private def parseDatasetPath(path: String): Option[RestrictedDataset] = {
+    if (path == null) {
+      return None
+    }
+    val trimmed = path.trim
+    if (!trimmed.startsWith("/")) {
+      return None
+    }
+    val segments = trimmed.split("/").filter(_.nonEmpty)
+    if (segments.length < 4) {
+      return None
+    }
+    val ownerEmail = segments(0)
+    val datasetName = segments(1)
+    Some(RestrictedDataset(ownerEmail, datasetName))
+  }
+
+  /**
+    * Checks if a dataset is downloadable by querying the database.
+    * Uses caching to avoid repeated database queries for the same dataset.
+    *
+    * @param dataset The dataset to check
+    * @param cache A cache to store lookup results
+    * @return Some(true) if downloadable, Some(false) if not, None if dataset 
doesn't exist
+    */
+  private def lookupDatasetDownloadable(
+      dataset: RestrictedDataset,
+      cache: mutable.Map[(String, String), Option[Boolean]]
+  ): Option[Boolean] = {
+    cache.getOrElseUpdate(

Review Comment:
   What is the purpose of using cache? Could you provide an example where the 
same database query is executed multiple times? Also, is it possible to avoid 
repeating queries by combining them into a single query?



##########
core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala:
##########
@@ -695,12 +869,33 @@ class WorkflowExecutionsResource {
       @Auth user: SessionUser
   ): Response = {
 
-    if (request.operators.size <= 0)
-      Response
+    if (request.operators.isEmpty) {

Review Comment:
   This change seems unnecessary. Can you revert it?



##########
core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala:
##########
@@ -695,12 +869,33 @@ class WorkflowExecutionsResource {
       @Auth user: SessionUser
   ): Response = {
 
-    if (request.operators.size <= 0)
-      Response
+    if (request.operators.isEmpty) {
+      return Response
         .status(Response.Status.BAD_REQUEST)
         .`type`(MediaType.APPLICATION_JSON)
         .entity(Map("error" -> "No operator selected").asJava)
         .build()
+    }
+
+    // Get ALL restrictions in workflow

Review Comment:
   Rename restrictions to non-downloadable.



##########
core/gui/src/app/workspace/component/result-exportation/result-exportation.component.ts:
##########
@@ -106,13 +121,22 @@ export class ResultExportationComponent implements OnInit 
{
       return;
     }
 
+    if (this.isExportRestricted) {
+      this.isTableOutput = false;
+      this.isVisualizationOutput = false;
+      this.containsBinaryData = false;
+      return;
+    }
+
+    const idsToEvaluate = this.exportableOperatorIds;

Review Comment:
   I think the name `exportableOperatorIds` is already self-explanatory, so 
remove this line.



##########
core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala:
##########
@@ -102,6 +103,179 @@ object WorkflowExecutionsResource {
     }
   }
 
+  /**
+    * Represents a dataset that has access restrictions for export.
+    * Used to track which datasets are non-downloadable and owned by other 
users.
+    *
+    * @param ownerEmail The email of the dataset owner
+    * @param datasetName The name of the dataset
+    */
+  private case class RestrictedDataset(ownerEmail: String, datasetName: 
String) {
+    def cacheKey: (String, String) = (ownerEmail.toLowerCase, 
datasetName.toLowerCase)
+    def label: String = s"$datasetName ($ownerEmail)"
+  }
+
+  /**
+    * Parses a file path to extract dataset information.
+    * Expected format: /ownerEmail/datasetName/...
+    *
+    * @param path The file path from operator properties
+    * @return Some(RestrictedDataset) if path is valid, None otherwise
+    */
+  private def parseDatasetPath(path: String): Option[RestrictedDataset] = {
+    if (path == null) {
+      return None
+    }
+    val trimmed = path.trim
+    if (!trimmed.startsWith("/")) {
+      return None
+    }
+    val segments = trimmed.split("/").filter(_.nonEmpty)
+    if (segments.length < 4) {
+      return None
+    }
+    val ownerEmail = segments(0)
+    val datasetName = segments(1)
+    Some(RestrictedDataset(ownerEmail, datasetName))
+  }
+
+  /**
+    * Checks if a dataset is downloadable by querying the database.
+    * Uses caching to avoid repeated database queries for the same dataset.
+    *
+    * @param dataset The dataset to check
+    * @param cache A cache to store lookup results
+    * @return Some(true) if downloadable, Some(false) if not, None if dataset 
doesn't exist
+    */
+  private def lookupDatasetDownloadable(
+      dataset: RestrictedDataset,
+      cache: mutable.Map[(String, String), Option[Boolean]]
+  ): Option[Boolean] = {
+    cache.getOrElseUpdate(
+      dataset.cacheKey, {
+        val record = context
+          .select(DATASET.IS_DOWNLOADABLE)
+          .from(DATASET)
+          .join(USER)
+          .on(DATASET.OWNER_UID.eq(USER.UID))
+          .where(
+            USER.EMAIL
+              .equalIgnoreCase(dataset.ownerEmail)
+              .and(DATASET.NAME.equalIgnoreCase(dataset.datasetName))
+          )
+          .fetchOne()
+        if (record == null) {
+          None
+        } else {
+          Option(record.value1())
+        }
+      }
+    )
+  }
+
+  /**
+    * Computes which operators in a workflow are restricted due to dataset 
access controls.
+    *
+    * This function:
+    * 1. Parses the workflow JSON to find all operators and their dataset 
dependencies
+    * 2. Identifies operators using non-downloadable datasets that the user 
doesn't own
+    * 3. Uses BFS to propagate restrictions through the workflow graph
+    * 4. Returns a map of operator IDs to the restricted datasets they depend 
on
+    *
+    * @param wid The workflow ID
+    * @param currentUser The current user making the export request
+    * @return Map of operator ID -> Set of restricted datasets that block its 
export
+    */
+  private def computeDatasetRestrictionMap(
+      wid: Int,
+      currentUser: UserPojo
+  ): Map[String, Set[RestrictedDataset]] = {
+    // Load workflow
+    val workflowRecord = context
+      .select(WORKFLOW.CONTENT)
+      .from(WORKFLOW)
+      .where(WORKFLOW.WID.eq(wid))
+      .fetchOne()
+
+    if (workflowRecord == null) {
+      return Map.empty
+    }
+
+    val content = workflowRecord.value1()
+    if (content == null || content.isEmpty) {

Review Comment:
   This if statement can be merged in the where clause of the query.



##########
core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala:
##########
@@ -102,6 +103,179 @@ object WorkflowExecutionsResource {
     }
   }
 
+  /**
+    * Represents a dataset that has access restrictions for export.
+    * Used to track which datasets are non-downloadable and owned by other 
users.
+    *
+    * @param ownerEmail The email of the dataset owner
+    * @param datasetName The name of the dataset
+    */
+  private case class RestrictedDataset(ownerEmail: String, datasetName: 
String) {
+    def cacheKey: (String, String) = (ownerEmail.toLowerCase, 
datasetName.toLowerCase)
+    def label: String = s"$datasetName ($ownerEmail)"
+  }
+
+  /**
+    * Parses a file path to extract dataset information.
+    * Expected format: /ownerEmail/datasetName/...
+    *
+    * @param path The file path from operator properties
+    * @return Some(RestrictedDataset) if path is valid, None otherwise
+    */
+  private def parseDatasetPath(path: String): Option[RestrictedDataset] = {
+    if (path == null) {
+      return None
+    }
+    val trimmed = path.trim
+    if (!trimmed.startsWith("/")) {
+      return None
+    }
+    val segments = trimmed.split("/").filter(_.nonEmpty)
+    if (segments.length < 4) {
+      return None
+    }
+    val ownerEmail = segments(0)
+    val datasetName = segments(1)
+    Some(RestrictedDataset(ownerEmail, datasetName))
+  }
+
+  /**
+    * Checks if a dataset is downloadable by querying the database.
+    * Uses caching to avoid repeated database queries for the same dataset.
+    *
+    * @param dataset The dataset to check
+    * @param cache A cache to store lookup results
+    * @return Some(true) if downloadable, Some(false) if not, None if dataset 
doesn't exist
+    */
+  private def lookupDatasetDownloadable(
+      dataset: RestrictedDataset,
+      cache: mutable.Map[(String, String), Option[Boolean]]
+  ): Option[Boolean] = {
+    cache.getOrElseUpdate(
+      dataset.cacheKey, {
+        val record = context
+          .select(DATASET.IS_DOWNLOADABLE)
+          .from(DATASET)
+          .join(USER)
+          .on(DATASET.OWNER_UID.eq(USER.UID))
+          .where(
+            USER.EMAIL
+              .equalIgnoreCase(dataset.ownerEmail)
+              .and(DATASET.NAME.equalIgnoreCase(dataset.datasetName))
+          )
+          .fetchOne()
+        if (record == null) {
+          None
+        } else {
+          Option(record.value1())
+        }
+      }
+    )
+  }
+
+  /**
+    * Computes which operators in a workflow are restricted due to dataset 
access controls.
+    *
+    * This function:
+    * 1. Parses the workflow JSON to find all operators and their dataset 
dependencies
+    * 2. Identifies operators using non-downloadable datasets that the user 
doesn't own
+    * 3. Uses BFS to propagate restrictions through the workflow graph
+    * 4. Returns a map of operator IDs to the restricted datasets they depend 
on
+    *
+    * @param wid The workflow ID
+    * @param currentUser The current user making the export request
+    * @return Map of operator ID -> Set of restricted datasets that block its 
export
+    */
+  private def computeDatasetRestrictionMap(
+      wid: Int,
+      currentUser: UserPojo
+  ): Map[String, Set[RestrictedDataset]] = {
+    // Load workflow
+    val workflowRecord = context
+      .select(WORKFLOW.CONTENT)
+      .from(WORKFLOW)
+      .where(WORKFLOW.WID.eq(wid))
+      .fetchOne()
+
+    if (workflowRecord == null) {
+      return Map.empty
+    }
+
+    val content = workflowRecord.value1()
+    if (content == null || content.isEmpty) {
+      return Map.empty
+    }
+
+    val rootNode =
+      try {
+        objectMapper.readTree(content)
+      } catch {
+        case _: Exception => return Map.empty
+      }
+
+    val operatorsNode = rootNode.path("operators")
+    val linksNode = rootNode.path("links")
+
+    // Find source operators
+    val datasetStatusCache = mutable.Map.empty[(String, String), 
Option[Boolean]]
+    val restrictedSourceMap = mutable.Map.empty[String, RestrictedDataset]
+    val adjacency = mutable.Map.empty[String, mutable.ListBuffer[String]]
+
+    operatorsNode.elements().asScala.foreach { operatorNode =>

Review Comment:
   Try to think about this approach:
   1. Parse the JSON file to get list of datasets used in the workflow
   2. Run a single query to check whether each dataset is downloadable or not



##########
core/gui/src/app/workspace/component/result-exportation/result-exportation.component.html:
##########
@@ -18,126 +18,144 @@
 -->
 
 <div class="centered-container">
-  <div class="input-wrapper">
-    <form nz-form>
-      <nz-row *ngIf="exportType !== 'data'">
-        <nz-col [nzSpan]="24">
-          <nz-form-item>
-            <nz-form-label nzFor="exportTypeInput">Export Type</nz-form-label>
-            <nz-form-control>
-              <nz-select
-                id="exportTypeInput"
-                [(ngModel)]="exportType"
-                name="exportType"
-                nzPlaceHolder="Select export type">
-                <nz-option
-                  *ngIf="isTableOutput"
-                  nzValue="arrow"
-                  nzLabel="Binary Format (.arrow)"></nz-option>
-                <nz-option
-                  *ngIf="isTableOutput && !containsBinaryData"
-                  nzValue="csv"
-                  nzLabel="Comma Separated Values (.csv)"></nz-option>
-                <nz-option
-                  *ngIf="isVisualizationOutput"
-                  nzValue="html"
-                  nzLabel="Web Page (.html)"></nz-option>
-                <nz-option
-                  nzValue="parquet"
-                  nzLabel="Parquet (.parquet)"></nz-option>
-              </nz-select>
-            </nz-form-control>
-          </nz-form-item>
-        </nz-col>
-      </nz-row>
-      <nz-row *ngIf="exportType === 'data'">
-        <nz-col [nzSpan]="24">
-          <nz-form-item>
-            <nz-form-label nzFor="filenameInput">Filename</nz-form-label>
-            <nz-form-control>
-              <input
-                id="filenameInput"
-                [(ngModel)]="inputFileName"
-                name="filename"
-                type="text"
-                nz-input
-                placeholder="Enter filename for binary data" />
-            </nz-form-control>
-          </nz-form-item>
-        </nz-col>
-      </nz-row>
-      <nz-row>
-        <nz-col [nzSpan]="24">
-          <nz-form-item>
-            <nz-form-label nzFor="destinationInput">Destination</nz-form-label>
-            <nz-form-control>
-              <nz-select
-                id="destinationInput"
-                [(ngModel)]="destination"
-                name="destination"
-                nzPlaceHolder="Select destination">
-                <nz-option
-                  nzValue="dataset"
-                  nzLabel="Dataset"></nz-option>
-                <nz-option
-                  nzValue="local"
-                  nzLabel="Local"></nz-option>
-              </nz-select>
-            </nz-form-control>
-          </nz-form-item>
-        </nz-col>
-      </nz-row>
-    </form>
-  </div>
   <div
     class="input-wrapper"
-    *ngIf="destination === 'dataset'">
-    <input
-      [(ngModel)]="inputDatasetName"
-      (input)="onUserInputDatasetName($event)"
-      type="text"
-      nz-input
-      name="datasetName"
-      placeholder="Search for dataset by name..."
-      [nzAutocomplete]="auto" />
-    <nz-autocomplete #auto>
-      <nz-auto-option
-        *ngFor="let dataset of filteredUserAccessibleDatasets"
-        [nzLabel]="dataset.dataset.name">
-        <div class="auto-option-content">
-          <div 
class="dataset-id-container">{{dataset.dataset.did?.toString()}}</div>
+    *ngIf="hasPartialRestriction && blockingDatasetLabels.length > 0">

Review Comment:
   ditto



##########
core/gui/src/app/workspace/component/result-exportation/result-exportation.component.ts:
##########
@@ -146,6 +170,9 @@ export class ResultExportationComponent implements OnInit {
   }
 
   onClickExportResult(destination: "dataset" | "local", dataset: 
DashboardDataset = {} as DashboardDataset) {
+    if (this.isExportRestricted) {

Review Comment:
   This looks like a dead code. Please remove it.



##########
core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala:
##########
@@ -102,6 +103,179 @@ object WorkflowExecutionsResource {
     }
   }
 
+  /**
+    * Represents a dataset that has access restrictions for export.
+    * Used to track which datasets are non-downloadable and owned by other 
users.
+    *
+    * @param ownerEmail The email of the dataset owner
+    * @param datasetName The name of the dataset
+    */
+  private case class RestrictedDataset(ownerEmail: String, datasetName: 
String) {
+    def cacheKey: (String, String) = (ownerEmail.toLowerCase, 
datasetName.toLowerCase)
+    def label: String = s"$datasetName ($ownerEmail)"
+  }
+
+  /**
+    * Parses a file path to extract dataset information.
+    * Expected format: /ownerEmail/datasetName/...
+    *
+    * @param path The file path from operator properties
+    * @return Some(RestrictedDataset) if path is valid, None otherwise
+    */
+  private def parseDatasetPath(path: String): Option[RestrictedDataset] = {
+    if (path == null) {
+      return None
+    }
+    val trimmed = path.trim
+    if (!trimmed.startsWith("/")) {
+      return None
+    }
+    val segments = trimmed.split("/").filter(_.nonEmpty)
+    if (segments.length < 4) {
+      return None
+    }
+    val ownerEmail = segments(0)
+    val datasetName = segments(1)
+    Some(RestrictedDataset(ownerEmail, datasetName))
+  }
+
+  /**
+    * Checks if a dataset is downloadable by querying the database.
+    * Uses caching to avoid repeated database queries for the same dataset.
+    *
+    * @param dataset The dataset to check
+    * @param cache A cache to store lookup results
+    * @return Some(true) if downloadable, Some(false) if not, None if dataset 
doesn't exist
+    */
+  private def lookupDatasetDownloadable(
+      dataset: RestrictedDataset,
+      cache: mutable.Map[(String, String), Option[Boolean]]
+  ): Option[Boolean] = {
+    cache.getOrElseUpdate(
+      dataset.cacheKey, {
+        val record = context
+          .select(DATASET.IS_DOWNLOADABLE)
+          .from(DATASET)
+          .join(USER)
+          .on(DATASET.OWNER_UID.eq(USER.UID))
+          .where(
+            USER.EMAIL
+              .equalIgnoreCase(dataset.ownerEmail)
+              .and(DATASET.NAME.equalIgnoreCase(dataset.datasetName))
+          )
+          .fetchOne()
+        if (record == null) {
+          None
+        } else {
+          Option(record.value1())
+        }
+      }
+    )
+  }
+
+  /**
+    * Computes which operators in a workflow are restricted due to dataset 
access controls.
+    *
+    * This function:
+    * 1. Parses the workflow JSON to find all operators and their dataset 
dependencies
+    * 2. Identifies operators using non-downloadable datasets that the user 
doesn't own
+    * 3. Uses BFS to propagate restrictions through the workflow graph
+    * 4. Returns a map of operator IDs to the restricted datasets they depend 
on
+    *
+    * @param wid The workflow ID
+    * @param currentUser The current user making the export request
+    * @return Map of operator ID -> Set of restricted datasets that block its 
export
+    */
+  private def computeDatasetRestrictionMap(

Review Comment:
   Maybe rename it to `getNonDownloadableOperatorMap`. The word restriction 
doesn't sound good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: prevent export of results from non-downloadable datasets [texera]

Reply via email to