[jira] [Commented] (PARQUET-2195) Add scan command to parquet-cli

2022-11-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629925#comment-17629925
 ] 

ASF GitHub Bot commented on PARQUET-2195:
-

shangxinli merged PR #998:
URL: https://github.com/apache/parquet-mr/pull/998




> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2195) Add scan command to parquet-cli

2022-11-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629921#comment-17629921
 ] 

ASF GitHub Bot commented on PARQUET-2195:
-

shangxinli commented on PR #998:
URL: https://github.com/apache/parquet-mr/pull/998#issuecomment-1305941081

   lgtm




> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2195) Add scan command to parquet-cli

2022-10-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615893#comment-17615893
 ] 

ASF GitHub Bot commented on PARQUET-2195:
-

wgtmac commented on code in PR #998:
URL: https://github.com/apache/parquet-mr/pull/998#discussion_r992446760


##
parquet-cli/src/main/java/org/apache/parquet/cli/commands/ScanCommand.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.cli.commands;
+
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.Parameters;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import com.google.common.io.Closeables;
+import org.apache.avro.Schema;
+import org.apache.parquet.cli.BaseCommand;
+import org.apache.parquet.cli.util.Expressions;
+import org.slf4j.Logger;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.List;
+
+@Parameters(commandDescription = "Scan all records from a file")
+public class ScanCommand extends BaseCommand {
+
+  @Parameter(description = "")
+  List sourceFiles;
+
+  @Parameter(
+names = {"-c", "--column", "--columns"},
+description = "List of columns")
+  List columns;
+
+  public ScanCommand(Logger console) {
+super(console);
+  }
+
+  @Override
+  public int run() throws IOException {
+Preconditions.checkArgument(
+  sourceFiles != null && !sourceFiles.isEmpty(),
+  "Missing file name");
+Preconditions.checkArgument(sourceFiles.size() == 1,
+  "Only one file can be given");

Review Comment:
   Fixed





> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2195) Add scan command to parquet-cli

2022-10-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615889#comment-17615889
 ] 

ASF GitHub Bot commented on PARQUET-2195:
-

wgtmac commented on code in PR #998:
URL: https://github.com/apache/parquet-mr/pull/998#discussion_r992437914


##
parquet-cli/src/main/java/org/apache/parquet/cli/commands/ScanCommand.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.cli.commands;
+
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.Parameters;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import com.google.common.io.Closeables;
+import org.apache.avro.Schema;
+import org.apache.parquet.cli.BaseCommand;
+import org.apache.parquet.cli.util.Expressions;
+import org.slf4j.Logger;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.List;
+
+@Parameters(commandDescription = "Scan all records from a file")
+public class ScanCommand extends BaseCommand {
+
+  @Parameter(description = "")
+  List sourceFiles;
+
+  @Parameter(
+names = {"-c", "--column", "--columns"},
+description = "List of columns")
+  List columns;
+
+  public ScanCommand(Logger console) {
+super(console);
+  }
+
+  @Override
+  public int run() throws IOException {
+Preconditions.checkArgument(
+  sourceFiles != null && !sourceFiles.isEmpty(),
+  "Missing file name");
+Preconditions.checkArgument(sourceFiles.size() == 1,
+  "Only one file can be given");
+
+final String source = sourceFiles.get(0);
+Schema schema = getAvroSchema(source);
+Schema projection = Expressions.filterSchema(schema, columns);
+
+long startTime = System.currentTimeMillis();
+Iterable reader = openDataFile(source, projection);
+boolean threw = true;
+long count = 0;
+try {
+  for (Object record : reader) {
+count += 1;

Review Comment:
   This is very handy when we want to quickly check if any file is corrupted in 
production.





> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2195) Add scan command to parquet-cli

2022-10-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615885#comment-17615885
 ] 

ASF GitHub Bot commented on PARQUET-2195:
-

wgtmac commented on code in PR #998:
URL: https://github.com/apache/parquet-mr/pull/998#discussion_r992436297


##
parquet-cli/src/main/java/org/apache/parquet/cli/commands/ScanCommand.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.cli.commands;
+
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.Parameters;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import com.google.common.io.Closeables;
+import org.apache.avro.Schema;
+import org.apache.parquet.cli.BaseCommand;
+import org.apache.parquet.cli.util.Expressions;
+import org.slf4j.Logger;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.List;
+
+@Parameters(commandDescription = "Scan all records from a file")
+public class ScanCommand extends BaseCommand {
+
+  @Parameter(description = "")
+  List sourceFiles;
+
+  @Parameter(
+names = {"-c", "--column", "--columns"},
+description = "List of columns")
+  List columns;
+
+  public ScanCommand(Logger console) {
+super(console);
+  }
+
+  @Override
+  public int run() throws IOException {
+Preconditions.checkArgument(
+  sourceFiles != null && !sourceFiles.isEmpty(),
+  "Missing file name");
+Preconditions.checkArgument(sourceFiles.size() == 1,
+  "Only one file can be given");
+
+final String source = sourceFiles.get(0);
+Schema schema = getAvroSchema(source);
+Schema projection = Expressions.filterSchema(schema, columns);
+
+long startTime = System.currentTimeMillis();
+Iterable reader = openDataFile(source, projection);
+boolean threw = true;
+long count = 0;
+try {
+  for (Object record : reader) {
+count += 1;

Review Comment:
   It serves the same purpose as 
https://github.com/apache/arrow/blob/master/cpp/tools/parquet/parquet_scan.cc 
to validate the data integrity of a parquet file.





> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2195) Add scan command to parquet-cli

2022-10-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615881#comment-17615881
 ] 

ASF GitHub Bot commented on PARQUET-2195:
-

wgtmac commented on code in PR #998:
URL: https://github.com/apache/parquet-mr/pull/998#discussion_r992431971


##
parquet-cli/src/main/java/org/apache/parquet/cli/commands/ScanCommand.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.cli.commands;
+
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.Parameters;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import com.google.common.io.Closeables;
+import org.apache.avro.Schema;
+import org.apache.parquet.cli.BaseCommand;
+import org.apache.parquet.cli.util.Expressions;
+import org.slf4j.Logger;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.List;
+
+@Parameters(commandDescription = "Scan all records from a file")
+public class ScanCommand extends BaseCommand {
+
+  @Parameter(description = "")
+  List sourceFiles;
+
+  @Parameter(
+names = {"-c", "--column", "--columns"},
+description = "List of columns")
+  List columns;
+
+  public ScanCommand(Logger console) {
+super(console);
+  }
+
+  @Override
+  public int run() throws IOException {
+Preconditions.checkArgument(
+  sourceFiles != null && !sourceFiles.isEmpty(),
+  "Missing file name");
+Preconditions.checkArgument(sourceFiles.size() == 1,
+  "Only one file can be given");
+
+final String source = sourceFiles.get(0);
+Schema schema = getAvroSchema(source);
+Schema projection = Expressions.filterSchema(schema, columns);

Review Comment:
   I believe its naming is a little bit confusing. It supports getting schema 
from parquet, avro and avsc files. 
   
   Please check here for detail: 
https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/BaseCommand.java#L397





> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2195) Add scan command to parquet-cli

2022-10-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614798#comment-17614798
 ] 

ASF GitHub Bot commented on PARQUET-2195:
-

shangxinli commented on code in PR #998:
URL: https://github.com/apache/parquet-mr/pull/998#discussion_r990827325


##
parquet-cli/src/main/java/org/apache/parquet/cli/commands/ScanCommand.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.cli.commands;
+
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.Parameters;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import com.google.common.io.Closeables;
+import org.apache.avro.Schema;
+import org.apache.parquet.cli.BaseCommand;
+import org.apache.parquet.cli.util.Expressions;
+import org.slf4j.Logger;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.List;
+
+@Parameters(commandDescription = "Scan all records from a file")
+public class ScanCommand extends BaseCommand {
+
+  @Parameter(description = "")
+  List sourceFiles;
+
+  @Parameter(
+names = {"-c", "--column", "--columns"},
+description = "List of columns")
+  List columns;
+
+  public ScanCommand(Logger console) {
+super(console);
+  }
+
+  @Override
+  public int run() throws IOException {
+Preconditions.checkArgument(
+  sourceFiles != null && !sourceFiles.isEmpty(),
+  "Missing file name");
+Preconditions.checkArgument(sourceFiles.size() == 1,
+  "Only one file can be given");
+
+final String source = sourceFiles.get(0);
+Schema schema = getAvroSchema(source);
+Schema projection = Expressions.filterSchema(schema, columns);
+
+long startTime = System.currentTimeMillis();
+Iterable reader = openDataFile(source, projection);
+boolean threw = true;
+long count = 0;
+try {
+  for (Object record : reader) {
+count += 1;

Review Comment:
   If your goal is only to get count, why not get it from the metadata? Iterate 
each record is an expensive operation. 





> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2195) Add scan command to parquet-cli

2022-10-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614797#comment-17614797
 ] 

ASF GitHub Bot commented on PARQUET-2195:
-

shangxinli commented on code in PR #998:
URL: https://github.com/apache/parquet-mr/pull/998#discussion_r990827069


##
parquet-cli/src/main/java/org/apache/parquet/cli/commands/ScanCommand.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.cli.commands;
+
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.Parameters;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import com.google.common.io.Closeables;
+import org.apache.avro.Schema;
+import org.apache.parquet.cli.BaseCommand;
+import org.apache.parquet.cli.util.Expressions;
+import org.slf4j.Logger;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.List;
+
+@Parameters(commandDescription = "Scan all records from a file")
+public class ScanCommand extends BaseCommand {
+
+  @Parameter(description = "")
+  List sourceFiles;
+
+  @Parameter(
+names = {"-c", "--column", "--columns"},
+description = "List of columns")
+  List columns;
+
+  public ScanCommand(Logger console) {
+super(console);
+  }
+
+  @Override
+  public int run() throws IOException {
+Preconditions.checkArgument(
+  sourceFiles != null && !sourceFiles.isEmpty(),
+  "Missing file name");
+Preconditions.checkArgument(sourceFiles.size() == 1,
+  "Only one file can be given");
+
+final String source = sourceFiles.get(0);
+Schema schema = getAvroSchema(source);
+Schema projection = Expressions.filterSchema(schema, columns);

Review Comment:
   What do we do if the file that doesn't have Avro Schema, 



##
parquet-cli/src/main/java/org/apache/parquet/cli/commands/ScanCommand.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.cli.commands;
+
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.Parameters;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import com.google.common.io.Closeables;
+import org.apache.avro.Schema;
+import org.apache.parquet.cli.BaseCommand;
+import org.apache.parquet.cli.util.Expressions;
+import org.slf4j.Logger;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.List;
+
+@Parameters(commandDescription = "Scan all records from a file")
+public class ScanCommand extends BaseCommand {
+
+  @Parameter(description = "")
+  List sourceFiles;
+
+  @Parameter(
+names = {"-c", "--column", "--columns"},
+description = "List of columns")
+  List columns;
+
+  public ScanCommand(Logger console) {
+super(console);
+  }
+
+  @Override
+  public int run() throws IOException {
+Preconditions.checkArgument(
+  sourceFiles != null && !sourceFiles.isEmpty(),
+  "Missing file name");
+Preconditions.checkArgument(sourceFiles.size() == 1,
+  "Only one file can be given");
+
+final String source = sourceFiles.get(0);
+Schema schema = getAvroSchema(source);
+Schema projection = Expressions.filterSchema(schema, columns);

Review Comment:
   What do we do if the file that doesn't have Avro Schema?





> Add scan command to parquet-cli
> 

[jira] [Commented] (PARQUET-2195) Add scan command to parquet-cli

2022-10-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614795#comment-17614795
 ] 

ASF GitHub Bot commented on PARQUET-2195:
-

shangxinli commented on code in PR #998:
URL: https://github.com/apache/parquet-mr/pull/998#discussion_r990826841


##
parquet-cli/src/main/java/org/apache/parquet/cli/commands/ScanCommand.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.cli.commands;
+
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.Parameters;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import com.google.common.io.Closeables;
+import org.apache.avro.Schema;
+import org.apache.parquet.cli.BaseCommand;
+import org.apache.parquet.cli.util.Expressions;
+import org.slf4j.Logger;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.List;
+
+@Parameters(commandDescription = "Scan all records from a file")
+public class ScanCommand extends BaseCommand {
+
+  @Parameter(description = "")
+  List sourceFiles;
+
+  @Parameter(
+names = {"-c", "--column", "--columns"},
+description = "List of columns")
+  List columns;
+
+  public ScanCommand(Logger console) {
+super(console);
+  }
+
+  @Override
+  public int run() throws IOException {
+Preconditions.checkArgument(
+  sourceFiles != null && !sourceFiles.isEmpty(),
+  "Missing file name");
+Preconditions.checkArgument(sourceFiles.size() == 1,
+  "Only one file can be given");

Review Comment:
   Why not define it as String instead of List? 





> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2195) Add scan command to parquet-cli

2022-09-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608560#comment-17608560
 ] 

ASF GitHub Bot commented on PARQUET-2195:
-

wgtmac commented on PR #998:
URL: https://github.com/apache/parquet-mr/pull/998#issuecomment-1255836576

   @shangxinli Please take a look when you have time. Thanks!




> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2195) Add scan command to parquet-cli

2022-09-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608558#comment-17608558
 ] 

ASF GitHub Bot commented on PARQUET-2195:
-

wgtmac opened a new pull request, #998:
URL: https://github.com/apache/parquet-mr/pull/998

   This PR enhances parquet-cli by adding a scan command to go through all 
records without printing them. This is useful when users need to verify if the 
parquet file is corrupted.
   
   No additional unit tests are added. Test it manually with local parquet 
files.




> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)